Online Convex Optimization Example And Follow-The-Leader

CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion We firs review he online convex opimizaion problem, before exploring a real-world example. Online Convex Opimizaion for = 1, 2,..., T player chooses w adversary chooses f simulaneously wih he player s decision) player s loss f w ) is compued player observes f 2 Real World Example - Google Search 2.1 Overview In class we explored a real world seing where - A user yped in he query The joy of cooking. - Google s Ad servers responded wih a lis of ads o show o he user. - The user may possibly click one of he ads. We migh hen ask he following quesion. How does Google decide which ads o show o a user? To idenify he online learning aspec of his seing, we firs explore some of he processes run by he ad server. The ad server does he following processes: 1. Maches ads on keywords. For insance, he keyword cooking migh be exraced from he user s query, and may be mached wih ads relaed o cooking. 2. For each ad, we compue feaures x R n, hen query predicion model o esimae he probabiliy of an ad being clicked. Noe ha his means ha we mach and predic on many more ads han we acually show. 3. Run aucion, which considers he click probabiliies compued from he previous sep, as well as he moneary value of he ad. 4. Choose which ads o show. We focus on he learning problem where we are rying o predic he probabiliy of an ad being clicked by a paricular user. 2.2 Example Model In class, we assumed a linear model w R n. Given he weighs, we migh decide o make a predicion like so: ˆp = w x. However, since we are rying o predic probabiliies, we wan o normalize he predicions o be beween 0 and 1. Tha is, we need a funcion fx) such ha f : R [0, 1]. We also wan his funcion 1

o be increasing, so ha higher x will correspond o a higher probabiliy. A common funcion ha achieves his is he sigmoid funcion, σx) = 1 1+e x. Thus, we predic he probabiliy like so: ˆp = σw x). Noe ha his is exacly how a logisic regression model calculaes an inpu example s class probabiliies. To learn he model, we have a raining sysem ha akes in x, y) pairs, where x is he feaure vecor of he ad, and y is he corresponding label, which in his case indicaes wheher or no he user has clicked he ad. Sill using he logisic regression model, if we were o use online gradien descen o opimize he loss funcion, he raining process will look like so. We use 1 o denoe an ad geing clicked, and 0 oherwise o make he mah work ou when using online gradien descen for logisic regression). Training Process - SGD for Logisic Regression for each x, y) ˆp = σw x) g = ˆp y)x w +1 w ηg Noe ha in he real seing, we will have o updae he w used by he ad server wih he new weigh vecors obained from he raining process. 2.3 Problems and Challenges In he real seing, n, he dimension of he feaure vecor, can be in he order of billions. In addiion o his, he online seing raise some challenges: 1. Jus wriing he model w consumes so much space. Moreover, we wan o search for he bes model ou here. 2. Compuing he predicion naively by doing a normal do produc will be oo slow, especially given he consrain ha we wan o give he feedback o he user in he order of hundreds of milliseconds. Forunaely, even hough w is dense, x, he feaure vecor, is usually sparse. This is especially rue if we are using a bag of words represenaion. Then, compuing w x will only involve a number of operaions ha is proporional o he number of non-zero elemens in x, as opposed o n. 3. Even if we perform well on he x, y) examples, we have o remember ha hese examples are obained from ads ha were acually shown. Thus, i migh be he case ha here migh be some unshown ads ha are acually good. A fuure opic relaed o his is exploraion. 2.4 Mapping back o OCO Going back he he OCO formulaion, we see ha w is he weigh vecor oupued by he raining process. Since we chose o use logisic regression as our model, we wan our loss funcion o be relaed o he condiional likelihood. The specific form chosen in class was f = log1 + exp y w x ))). I was commened in class ha his loss funcion can be viewed as a sof version of he hinge loss. The plo below shows his. 2

However, here were also some commens ha OCO does no model some aspecs of he given example. For one, here is a delay before he player acually observes f, since his informaion is only available afer he user has made a clicking decision. 2.5 Commens on using Regre Analysis To review, Regre = f w ) min f w). When comparing beween wo models A and B, we noe ha we can cancel ou he consan erm on he righ: RegreA) RegreB) = f A ) f B ) = LossA) LossB). When using regre as a meric o opimize on, we have o be wary if W has changed when we made changes o a model; a higher regre migh mislead us ino hinking ha a model is worse han he oher, bu his is no necessarily rue. 3 Follow-The-Leader FTL) Algorihm The firs new algorihm ha we explore o apply in he OCO framework is Follow-The-Leader, which was described o be he simples, mos naural, and impracical online learning algorihm. Much of his secion is adaped from he survey of Shai Shalev-Shwarz [1]. Follow-The-Leader w 1 is se arbirarily for = 1, 2,..., T w = argmin s=1 f sw) There are wo ways o view he algorihm. 1. We are playing on round he weigh vecor ha minimizes he regre if he game sopped on round 1. 2. We use all he seen examples as a bach machine learning problem, and solve for he bes weigh vecor. 3

We noe ha he longer we run his algorihm, he slower he algorihm ges, since he bach problem becomes larger and larger Impracical). We now prove our firs regre bound. Noe ha regres can also be expressed as being relaive o a chosen vecor, insead of jus being compared agains he bes model in hindsigh. In he lemma below, Regreu) denoes he regre wih respec o he vecor u. Lemma 1. The poins w 1,..., w T played by FTL saisfy u W, Regreu) T =1 f w ) f w +1 ) ) Proof. We firs resae he lemma. Subsiuing he definiion of Regreu) gives us he following equivalen saemen: T =1 f w ) f u) ) T =1 f w ) f w +1 ) ), which reduces o f w +1 ) =1 f u). In class, he expression on he lef side of he inequaliy was called he cumulaive loss of he Be-The-Leader BTL) algorihm, which cheas by looking ino he nex example. This erm was inroduced by Kalai e al. [2] We now use inducion o prove u W, =1 f w +1 ) =1 f u). Base case. T = 1. f 1 w 2 ) f 1 u) for any u is rue because w 2 minimizes f 1. 2 Recall ha w 2 = argmin s=1 f sw) = argminf 1 w). Inducive Hypohesis. We assume he saemen above holds rue for ime T 1, ha is, =1 u W, f w +1 ) f u). =1 =1 Inducive Sep. From he above I.H., we complee he inducive sep by showing ha he saemen holds rue for ime T, via he following sequence of equivalen inequaliies: f w +1 ) f u) =1 =1 f w +1 ) + f T w T +1 ) f u) + f T w T +1 ) =1 =1 f w +1 ) f u) + f T w T +1 ). Since he saemen above is rue for all u, we can ake u = w T +1 which gives us =1 f w +1 ) =1 =1 f w T +1 ). We can view he erm on he righ side of he inequaliy as a funcion of vecor, fv) = T =1 f v). In his case, he argumen is v = w T +1. We noe ha w T +1 is exacly he minimizer of fv). Thus, for all u, T =1 f w T +1 ) T =1 f u). Wih his final inequaliy, we end he inducive proof. =1 4

We now explore an example where FTL performs poorly. Example 1 Assume ha W [, 1], f w) = g w, where g [, 1] is chosen by he adversary, and he player chooses w. We also abbreviaed s=1 g s as g 1:. Below we show a game beween a player wih he FTL sraegy ha arbirarily picks w 1 = 0, and an adversary ha wans o cause he player o incur losses noe ha he adversary also knows ha he FTL sraegy is deerminisic). We also show he BTL sraegy on he righ hand side for comparison. We consider wha he FTL sraegy will do under his seing, where he poins played are given by w = argmin f s w) w [,1] s=1 = argmin g s w w [,1] s=1 = argmin g 1: w. w [,1] So, he FTL sraegy always chooses he sign opposie o g 1:, wih magniude 1. FTL w g FTL Loss g 1: BTL ŵ = w +1 BTL Loss 1 0 0.5 0 0.5-1 -0.5 2-1 -1 1-0.5 1-1 3 1 1 1 0.5-1 -1 4-1 -1 1-0.5 1-1.............. The regre of FTL is T 1) 0.5) T, which is bad, because we wan he regre o be ot ). Noing ha he cumulaive loss of BTL is T, using he previous bound we jus proved, we ge ha u W, Regreu) T 0.5) T + 0.5) = 2T 1. The upperbound is above T, which makes his regre bound a weak one. Towards he end of he lecure, i was hined ha wih he Follow-The-Regularized-Leader algorihm, we can improve he sabiliy of he FTL algorihm by adding a regularizing erm. Bu ha s for he nex lecure! References [1] Shai Shalev-Shwarz, Online Learning and Online Convex Opimizaion, Foundaions and Trends in Machine Learning, 2012. [2] Adam Kalai, Sanosh Vempala, Efficien Algorihms for online decision problems, Journal of Compuer and Sysem Sciences, 2005. 5