From Multiarmed Bandits to Stochastic Optimization

Size: px

Start display at page:

Download "From Multiarmed Bandits to Stochastic Optimization"

Austin Ferguson
5 years ago
Views:

1 From Muliarmed Bandis o Sochasic Opimizaion Muliarmed Bandis Workshop Roerdam, NL May 24, 2018 Warren B. Powell Princeon Universiy Deparmen of Operaions Research and Financial Engineering 2018 Warren B. Powell, Princeon Universiy

2 The muliarmed bandi sory

3 Maerials science» Opimizing payloads: reacive species, biomolecules, fluorescen markers,» Conrollers for roboic scienis for maerials science experimens» Opimizing nanoparicles o maximize phooconduciviy

he design of proecive membranes o conrol drug release»

4 Learning problems Healh sciences» Sequenial design of experimens for drug discovery» Drug delivery Opimizing he design of proecive membranes o conrol drug release» Medical decision making Opimal learning for medical reamens.

5 Drug discovery Opimizing he configuraion of molecules Design of effecive policies can accelerae he search process for new drugs.

Opimal learning in diabees How do we find he bes reamen for diabees?» The sandard reamen is a medicaion called meformin, which works for abou 70 percen of paiens.

6 Opimal learning in diabees How do we find he bes reamen for diabees?» The sandard reamen is a medicaion called meformin, which works for abou 70 percen of paiens.» Wha do we do when meformin does no work for a paien?» There are abou 20 oher reamens, and i is a process of rial and error. Docors need o ge hrough his process as quickly as possible.

7 Truckload brokerages Now we have a logisic curve for each origin-desinaion pair (i,j) P Y ( p, a ) e 1 e 0 a ij ij p ij a 0 a ij ij p ij a Shipper Carrier Number of offers for each (i,j) pair is relaively small. Need o generalize he learning across raffic lanes. Slides ha follow are from senior hesis of Connor Werh 2017 Offered price

8 Ad-click opimizaion Opimizing bids for inerne ads» In parnership wih Roomsage.com» Developed Princeon ad-click game» Teams compee o find bes policy Clicks Bid ($/click) Profis

plan?» How o reac? Hurricane Sandy» Once in 100 years?

9 Emergency sorm response The power grid» Loss of power creaes cascading failures (lack of fuel, inabiliy o pump waer)» How o plan?» How o reac? Hurricane Sandy» Once in 100 years?» Rare convergence of evens» Bu, meeorologiss did an amazing job of forecasing he sorm.

10 Emergency sorm response cal l X X 0.76 cal l cal l X X X cal l 0.03 cal l X 0.67 cal l cal l 10

11 Emergency sorm response cal l X X 0.76 cal l X cal l cal l cal l X X X 0 cal l cal l 11

Emergency sorm response cal l 0.503 0.503 0.54 0.503 0.06 0.

0 0.0 0.0 0.0 0.0 0.0 X cal l 0.0 0.0 cal l 0.0 0.0 0.06 0.

12 Emergency sorm response cal l X X 0.76 cal l X cal l cal l cal l 0 0 X 0 0 X X cal l cal l 12

13 The bandi vocabulary

14 The bandi vocabulary

15 Arms

16 and bandis

17 Muliarmed bandi problems Wha is a bandi problem?» The lieraure seems o characerize a bandi problem as any problem where a policy has o balance exploraion vs. exploiaion.» Bu his means ha a bandi problem is defined by how i is solved. E.g., if you use a pure exploraion policy, is i a bandi problem? My definiion:» Any sequenial decision problem which involves learning, and where we have direc or indirec conrol over he informaion ha is colleced.

18 Muliarmed bandi problems Dimensions of a bandi problem:» The arms (decisions) may be Binary (A/B esing, sopping problems) Discree alernaives (drug, caalys, ) Coninuous choices (price) Vecor-valued (baskeball eam, producs, movies, ) Muliaribue (aribues of a movie, song, person) Saic vs. dynamic choice ses Sequenial vs. bach» Informaion (wha we observe) Success-failure/discree oucome Exponenial family (e.g. Gaussian, exponenial, ) Heavy-ailed (e.g. Cauchy) Daa-driven (disribuion unknown) Saionary vs. nonsaionary processes Lagged responses? Adversarial?

19 Muliarmed bandi problems Dimensions of a bandi problem:» Belief models Lookup ables (hese are mos common) Independen or correlaed beliefs Parameric models Linear or nonlinear in he parameers Nonparameric models Locally linear Deep neural neworks/svm Bayesian vs. frequenis» Objecive funcion Expeced performance (e.g. regre) Offline (final reward) vs. online (cumulaive reward) Jus ineresed in final design? Or opimizing while learning? Risk merics

20 Ouline Elemens of a sequenial decision model Mixed sae problems Designing policies Searching for he bes policy

21 Ouline Elemens of a sequenial decision model Mixed sae problems Designing policies Searching for he bes policy

22 Modeling Any sequenial decision problem consiss of five core elemens:» Sae variables» Decision variables» Exogenous informaion» Transiion funcion» Objecive funcion

23 Modeling dynamic problems The sae variable: Conrols communiy x "Informaion sae" Operaions research/mdp/compuer science S R, I, B Sysem sae, where: R Resource sae (physical sae) Locaion/saus of ruck/rain/plane Energy in sorage I Informaion sae Prices Weaher B Belief sae ("sae of knowledge") Belief abou raffic delays Belief abou he saus of equipmen

24 Modeling dynamic problems The sae variable:» The iniial sae conains: All deerminisic parameers Iniial values of dynamic parameers Prior disribuion of belief abou unknown parameers» The dynamic sae conains All informaion ha changes over ime. Physical sae Informaion sae Belief sae (Bayesian updaing): m bm + b n n W n+ 1 n+ 1 x x = n W b b = b + b n+ 1 n W x x + b W

25 Modeling dynamic problems Decisions: Markov decision processes/compuer science a Discree acion Conrol heory u Low-dimensional coninuous vecor Operaions research x Usually a discree or coninuous bu high-dimensional vecor of decisions. A his poin, we do no specify how o make a decision. Insead, we define he funcion X ( s) (or A ( s) or U ( s)), where specifies he ype of policy. " " carries informaion abou he ype of funcion f, and any unable parameers f.

26 The decision variables Syles of decisions»binary xx 0,1» Finie x X 1,2,..., M Classic bandi model» Coninuous scalar x X a, b» Coninuous vecor x( x1,..., xk), xk» Discree vecor x( x1,..., xk), xk» Caegorical x ( a,..., a ), a is a caegory (e.g. paien aribues) 1 I i

27 Modeling dynamic problems Exogenous informaion: W New informaion ha firs became known a ime = Rˆ, Dˆ, pˆ, Eˆ Rˆ Equipmen failures, delays, new arrivals New drivers being hired o he nework Dˆ New cusomer demands pˆ Changes in prices Eˆ Informaion abou he environmen (emperaure,...) Noe: Any variable indexed by is known a ime. This convenion, which is no sandard in conrol heory, dramaically simplifies he modeling of informaion. Below, we le represen a sequence of acual observaions W1, W2,... W refers o a sample realizaion of he random variable W.

Modeling dynamic problems The ransiion funcion M S 1 S (

Plan model Plan equaion Transiion law Bayesian updaing of

28 Modeling dynamic problems The ransiion funcion M S 1 S ( S, x, W 1) R ˆ 1 R x R 1 Invenories p ˆ 1 p p 1 Spo prices D D Dˆ Marke demands 1 1 W n n W n1 n1 x x n W n1 n W x x Also known as he: Sysem model Sae ransiion model Plan model Plan equaion Transiion law Bayesian updaing of belief Transfer funcion Transformaion funcion Law of moion Model

29 Modeling sochasic, dynamic problems The universal objecive funcion» Cumulaive reward (classical bandi objecive) T max, ( ), C S X S W S 0» Final reward ( bes arm bandi objecive) max F( x, W) Given a sysem model (ransiion funcion) S S S, x, W ( ) and a sochasic process: M 1 1 S0, W1, W2,..., WT 1 0, N ˆ

30 Sochasic Approximae Robus Decision dynamic opimizaion Simulaion programming analysis opimizaion Opimal Dynamic Model learning Programming predicive and Sochasic conrol Opimal Bandi conrol search conrol problems Online programming Sochasic conrol Reinforcemen learning Markov decision processes compuaion Simulaion opimizaion

31 Ouline Elemens of a sequenial decision model Mixed sae problems Designing policies Searching for he bes policy

32 Modeling dynamic problems Some major problem classes» Pure physical sae Invenory problems Sochasic shores pah problems» Physical plus informaion Invenory wih exogenous prices, weaher,» Pure belief saes These are classical bandi problems Differen ypes of belief models» Belief plus informaion Paien arriving o docor s office who hen prescribes a drug. Conexual bandi problems» Everyhing: Revenue managemen Clinical rials

33 Modeling dynamic problems Mixed sae problems (physical and belief sae)» Clinical rials Learning he performance of a new drug (belief sae) Tracking he number of paiens signed up (physical sae)» Revenue managemen for hoels Learning marke response o price (belief sae) Tracking how many rooms have been reserved (physical sae)» An energy sorage problem

34 An energy sorage problem Consider a basic energy sorage problem:» We have o manage he flows of energy (blue lines) while managing differen sources of uncerainy.

35 An energy sorage problem E Transiion funcion wihou learning B L G E E Eˆ 1 1 p p p p D p f D D 1, 1 1 R R x baery baery 1

36 An energy sorage problem E Transiion funcion wih passive learning B L E E Eˆ 1 1 p p p p D p f D D 1, 1 1 R R x baery baery 1

37 Learning in sochasic opimizaion Updaing he demand parameer»le be he new price and le n F ( x ) p p p» We updae our esimae using our recursive leas squares equaions: 1 q = q - B fe e g ( T ff ( ) ) + 1 g = F( x q )-p, 1 B = B - B B T g = 1 + ( f) B f æ p ö = ç çèp ø f p- 1-2

38 An energy sorage problem E Transiion funcion wih acive learning B L E E Eˆ 1 1 p p p p x D GB p f D D 1, 1 1 R R x baery baery 1

39 Ouline Elemens of a sequenial decision model Mixed sae problems Designing policies Searching for he bes policy

40 Designing policies We have o sar by describing wha we mean by a policy.» Definiion: A policy is a mapping from a sae o an acion. any mapping. How do we search over an arbirary space of policies?

41 Designing policies Two fundamenal sraegies: 1) Policy search Search over a class of funcions for making decisions o opimize some meric. T max f f C, ( ) (, ) S f F X S S 0 0 2) Lookahead approximaions Approximae he impac of a decision now on he fuure. T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1

42 Designing policies Policy search: 1a) Policy funcion approximaions (PFAs) Lookup ables when in his sae, ake his acion Parameric funcions Order-up-o policies: if invenory is less han s, order up o S. PFA Affine policies - x X ( S ) ff ( S) ff Neural neworks Locally/semi/non parameric Requires opimizing over local regions 1b) Cos funcion approximaions (CFAs) Opimizing a deerminisic model modified o handle uncerainy (buffer socks, schedule slack) CFA X ( S ) argmax x x x PFA x X ( S )

43 Designing policies Lookahead policies 2a) Value funcion approximaions We approximae he impac of a decision on he fuure T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1 Approximaing he value of being in a downsream sae using machine learning ( value funcion approximaions ) X ( S ) arg max C( S, x ) V ( S ) S, x * x 1 1 X ( S ) arg max C( S, x ) V ( S ) S, x VFA x 1 1 x x arg max CS (, x) V ( S) x

44 Designing policies Lookahead policies 2a) Value funcion approximaions We approximae he impac of a decision on he fuure T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1 Approximaing he value of being in a downsream sae using machine learning ( value funcion approximaions ) X ( S ) arg max C( S, x ) V ( S ) S, x * x 1 1 X ( S ) arg max C( S, x ) V ( S ) S, x VFA x 1 1 x x arg max CS (, x) V ( S) x

45 Designing policies Lookahead policies 2a) Value funcion approximaions We approximae he impac of a decision on he fuure T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1 Approximaing he value of being in a downsream sae using machine learning ( value funcion approximaions ) X ( S ) arg max C( S, x ) V ( S ) S, x * x 1 1 X ( S ) arg max C( S, x ) V ( S ) S, x VFA x 1 1 x x arg max CS (, x) V ( S) x

46 Designing policies 2b) Direc lookahead policies T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1

47 Designing policies 2b) Direc lookahead policies» We replace he exac lookahead T * X ( ) arg max (, ) max S x C S ( ', '( ')) 1, x C S X S S S x ' 1 wih an approximaion called he lookahead model: H * X ( S) arg max x (, ) max ( ', '( ')), 1, C S x C S X S S S x ' 1» A lookahead policy works by approximaing he lookahead model.

48 Designing policies Types of lookahead approximaions» One-sep lookahead Widely used in pure learning policies: Bayes greedy/naïve Bayes Thompson sampling Value of informaion (knowledge gradien)» Muli-sep lookahead Deerminisic lookahead, also known as model predicive conrol, rolling horizon procedure Sochasic lookahead: Two-sage (widely used in sochasic linear programming) Mulisage» Mone carlo ree search (MCTS) for discree acion spaces» Mulisage scenario rees (sochasic linear programming) ypically no racable.

Four (mea)classes of policies Policy search Lookahead approximaions 1) Policy funcion

approximaion (CFAs) CFA» X ( S ) argmax C ( S, ) ( ) x x X 3) Policies based on value funcion

(DLAs)» Deerminisic lookahead/rolling horizon proc.

/sochasic prog/mone Carlo ree search x, x,1,..., x,t» Robus opimizaion T LAD ' ' x,.

49 Four (mea)classes of policies Policy search Lookahead approximaions 1) Policy funcion approximaions (PFAs)» Lookup ables, rules, parameric/nonparameric funcions 2) Cos funcion approximaion (CFAs) CFA» X ( S ) argmax C ( S, ) ( ) x x X 3) Policies based on value funcion approximaions (VFAs)» VFA x x X ( S) argmax x C( S, ) (, ) x V S S x 4) Direc lookahead policies (DLAs)» Deerminisic lookahead/rolling horizon proc./model predicive conrol» Chance consrained programming PAx [ fw ( )] 1» Sochasic lookahead /sochasic prog/mone Carlo ree search x, x,1,..., x,t» Robus opimizaion T LAD ' ' x,..., x, H ' 1 X ( S ) arg max C( S, x ) C( S, x ) T LAS ' ' ' 1 X ( S ) arg max C( S, x ) p( ) C( S ( ), x ( )) T LARO ' ' x,..., x, H ww ( ) ' 1 X ( S ) arg max min C( S, x ) C( S ( w), x ( w))

50 Four (mea)classes of policies Funcion approx. 1) Policy funcion approximaions (PFAs)» Lookup ables, rules, parameric/nonparameric funcions 2) Cos funcion approximaion (CFAs) CFA» X ( S ) argmax C ( S, ) ( ) x x X 3) Policies based on value funcion approximaions (VFAs)» VFA x x X ( S) argmax x C( S, ) (, ) x V S S x 4) Direc lookahead policies (DLAs)» Deerminisic lookahead/rolling horizon proc./model predicive conrol T LAD ' ' x,..., x, H ' 1 X ( S ) arg max C( S, x ) C( S, x )» Chance consrained programming PAx [ fw ( )] 1» Sochasic lookahead /sochasic prog/mone Carlo ree search x, x,1,..., x,t» Robus opimizaion T LAS ' ' ' 1 X ( S ) arg max C( S, x ) p( ) C( S ( ), x ( )) T LARO ' ' x,..., x, H ww ( ) ' 1 X ( S ) arg max min C( S, x ) C( S ( w), x ( w))

Four (mea)classes of policies Imbedded opimizaion 1) Policy funcion approximaions (PFAs)» Lookup ables, rules,

value funcion approximaions (VFAs)» VFA x x X ( S) argmax x C( S, ) (, ) x V S S x 4) Direc lookahead policies (DLAs)»

.., x, H ' 1 X ( S ) arg max C( S, x ) C( S, x )» Chance consrained programming PAx [ fw ( )] 1» Sochasic lookahead /sochasic

51 Four (mea)classes of policies Imbedded opimizaion 1) Policy funcion approximaions (PFAs)» Lookup ables, rules, parameric/nonparameric funcions 2) Cos funcion approximaion (CFAs) CFA» X ( S ) argmax C ( S, ) ( ) x x X 3) Policies based on value funcion approximaions (VFAs)» VFA x x X ( S) argmax x C( S, ) (, ) x V S S x 4) Direc lookahead policies (DLAs)» Deerminisic lookahead/rolling horizon proc./model predicive conrol T LAD ' ' x,..., x, H ' 1 X ( S ) arg max C( S, x ) C( S, x )» Chance consrained programming PAx [ fw ( )] 1» Sochasic lookahead /sochasic prog/mone Carlo ree search x, x,1,..., x,t» Robus opimizaion T LAS ' ' ' 1 X ( S ) arg max C( S, x ) p( ) C( S ( ), x ( )) T LARO ' ' x,..., x, H ww ( ) ' 1 X ( S ) arg max min C( S, x ) C( S ( w), x ( w))

52 Policies for pure learning problems 1) Policy funcion approximaion (PFA)» Revenue maximizaion problem Demand funcion D( p q ) = q -q Revenue n n n 1 2 R( p q ) = pd( p) = q p-q p Need o une p n n n 1 2 PFA policy pure exploiaion n q p = 2q n 1 n 2 PFA policy wih acive exploraion ( exciaion policy ) q e p = + e e N(0, s ) n n 1 n n n 2q2 2

53 Policies for pure learning problems 1) Policy funcion approximaion (PFA)» Linear decision rules ( affine policies ) X ( S q) = q + qf ( S ) + q f ( S ) q f ( S ) PFA n n n n F F» Neural neworks

54 Policies for pure learning problems 2) Cos funcion approximaions (CFA)» Upper confidence bounding X ( S ) argmax» Inerval esimaion log n N UCB n UCB n UCB x x n x 0 n z x IE n IE n IE n X ( S ) argmax x x x» Bolzmann exploraion ( sof max ) n x n e Choose x wih probabiliy: Px ( ) e x' n x '

55 Policies for pure learning problems A learning problem wih correlaed beliefs

56 Policies for pure learning problems Picking a he mean. means we are evaluaing each choice

57 Policies for pure learning problems Picking means we are evaluaing each choice a he 95 h percenile

58 Policies for pure learning problems PFAs and CFAs have o be uned» Final reward ( offline learning ) max F ( x N, W ˆ) ( x N IE ( ), W ˆ),, IE 1 N Wˆ,..., W» Cumulaive reward ( online learning ) max, ( ), T IE IE E C S X S W 1 S 0 0» Boh require searching over unable parameers. Offline uning is classical sochasic search Online uning is a relaively open research area W

59 Cos funcion approximaions Tuning he inerval esimaion policy X ( S ) argmax IE n IE n IE n x x x

60 Policies for pure learning problems 3) Policies based on value funcion approximaions» VFAs using a physical sae problem V S C S x E V S S n n n n1 n1 n ( ) max x (, ) ( ) Curren node (e.g. node 2)

61 Policies for pure learning problems 3) Policies based on value funcion approximaions» VFAs using a physical sae problem V S C S x E V S S n n n n1 n1 n ( ) max x (, ) ( ) Decision o go o a node (e.g. 5) Downsream node (e.g. 5)

62 Policies for pure learning problems 3) Policies based on value funcion approximaions» VFAs using a learning problem 2 5 S 5 N, S ( S,..., S ) n n n V S C S x E V S S n n n n1 n1 n ( ) max x (, ) ( ) Curren sae of knowledge New sae of knowledge Decision o make a measuremen

63 Policies for pure learning problems 3) Policies based on value funcion approximaions» Illusraion: finding he bes drug in he se» Afer a es we observe success or failure: ì n+ 1 1 Success n Wx = ï í If x = x ïî 0 Failure»Le Probabiliy ha drug x is successful. We assume ha n n n rx S Bea( ax, bx) where is our belief sae, wih updaing equaions: a = a + W, b = b + (1-W ) n+ 1 n n+ 1 n+ 1 n n+ 1 x x x x x x

64 Policies for pure learning problems 3) Policies based on value funcion approximaions» Bellman s equaion: V = é êë W + V + W + -W S n n n n 1 n 1 n n 1 n n 1 n ( a, b ) max x + x g + ( a +, b 1 + )» This can be solved for a sopping problem o deermine when o sop esing a single drug.» Problemaic if and are vecors. Giins developed a novel decomposiion ha allows us o solve his problem for one drug ( arm ) a a ime. ù úû

65 Policies for pure learning problems 3) Policies based on value funcion approximaions» For normally disribued rewards, Giins (1974) showed ha we can solve dynamic programs for each alernaive.» Produces a policy ha looks like X æ æ n s ö ö ( S ) = arg max x mx s, g + G W ç ç s è è øø Gi n n W x æ n s ö ç W x where, g is he Giins index obained by Gç ç s çè ø solving a dynamic program for wheher o coninue or sop esing a single drug.» Considered a compuaional breakhrough, bu compuing Giins indices is sill a challenge, and only applies o special cases.

66 Policies for pure learning problems 4) Policies based on direc lookaheads (DLA)» The knowledge gradien for offline (final reward): E max F( y, B ( x)) max F( y, B ) KG, n n 1 n x y y Finding he new design wih our new belief (bu wihou knowing he oucome of he experimen) Choosing he bes design given wha we know now. Curren belief sae Averaging over he possible oucomes of he experimen (and our differen beliefs abou parameers) Updaed parameer esimaes afer running experimen wih densiy x. Proposed experimen

67 The knowledge gradien 4) Policies based on direc lookaheads (DLA)» The knowledge gradien compues he expeced improvemen from a single experimen Change which produces a change in he decision. Change in esimae of value of opion 5 due o measuremen

68 The knowledge gradien Esimaed value of alernaive Sandard deviaion Knowledge gradien

69 The knowledge gradien

70 The knowledge gradien

71 The knowledge gradien Some properies of he knowledge gradien for offline (final reward) problems.,» Asympoically opimal (finds bes x in he limi)» Opimal (by consrucion) if budge =1.» Opimal for all n if number of alernaives = 2 (e.g. A/B esing).» Only saionary policy ha is boh myopically and asympoically opimal. For online problems» Asympoically opimal (finds bes x in he limi) as

The knowledge gradien Differen belief models» Lookup ables Independen beliefs Correlaed beliefs» Linear parameric models Linear models Sparse-linear Tree regression»

73 The knowledge gradien Differen belief models» Lookup ables Independen beliefs Correlaed beliefs» Linear parameric models Linear models Sparse-linear Tree regression» Nonlinear parameric models Logisic regression Neural neworks» Nonparameric models Gaussian process regression Kernel regression Suppor vecor machines Deep neural neworks

74 The knowledge gradien The marginal value of informaion» Repeaedly sampling he same alernaive Value of informaion Number of imes we sample he same alernaive

75 The knowledge gradien The marginal value of informaion» The value of informaion may be concave if an experimen is noisy Value of informaion Number of imes we sample he same alernaive

76 The knowledge gradien The marginal value of informaion» The value of informaion may be concave if an experimen is noisy Value of informaion Number of imes we sample he same alernaive

77 The knowledge gradien From offline o online learning» The knowledge gradien compues he value of informaion for a erminal reward objecive: E max F( y, B ( x)) max F( y, B ) KG, n n 1 n x y y» Imagine ha we have a budge of N experimens, and ha we are summing rewards over his horizon. The value of informaion from a single experimen is now ( Nn) KG, n KG, x x x Expeced reward Remaining horizon Offline KG

78 The knowledge gradien Knowledge gradien for offline and online learning Offline learning KG, n x Online learning ( N n) KG, n KG, x x x» This bridges wha have hisorically been fundamenally differen fields.

79 Ouline Elemens of a sequenial decision model Mixed sae problems Designing policies Searching for he bes policy

80 Designing policies Finding he bes policy» We have o firs ariculae our classes of policies f f PFAs, CFAs, VFAs, DLAs Parameers ha characerize each family.» So minimizing over means: f,» We hen have o pick an objecive such as T max C S, X ( S ) S 0 or f 0 max F( X, W) S T 0

81 Muliarmed bandi problems Policy search class» Policies end o be relaively simple and easy o compue» Well suied o rapid (e.g. inerne speed) learning applicaions needing fas compuaion.» Tuning is imporan, and ypically requires a realisic simulaor. Lookahead class» Policies can be relaively complex o compue.» Well suied o problems wih expensive experimens.» Typically avoids uning, bu may require a prior.

82 Muliarmed bandi problems Noes:» Any of he four classes of policies may be appropriae depending on he characerisics of he problem.» Acive learning arises in many applicaions, bu is ofen overlooked.» The bandi culure of coming up wih problem variaions should be inheried by oher communiies.» Bandi researchers ofen focus on good bu no opimal policies (e.g. UCB policies) wih good characerisics (e.g. robus across a wide range of disribuions).

83 MOLTE Modular, opimal learning esing environmen» Malab-based environmen wih modular library of problems and algorihms, each in is own.m file.» User specifies in a spreadshee which algorihms are run on which problems hp://

84 MOLTE Comparison on library problems

85 Princeon ad-click game In collaboraion wih Roomsage.com

86 Princeon ad-click game Learning he bid-response curve Clicks Bid ($/click)» Varies by hour of week» Response depends on locaion, age, gender, device

87 Princeon ad-click game The ad-click game:» Learn he bes policy for bidding for ads» Bids compee in a simulaed aucion following he rules used by Google

88 Thank you! For more informaion, please visi: hp:// See Courses or he jungle webpages.

Tutorial: A Unified Modeling and Algorithmic Framework for Optimization under Uncertainty

Tutorial: A Unified Modeling and Algorithmic Framework for Optimization under Uncertainty Tuorial: A Unified Modeling and Algorihmic Framework for Opimizaion under Uncerainy Informs Opimizaion Sociey Meeing March 23, 2018 Warren B. Powell Princeon Universiy Deparmen of Operaions Research and