Online Convex Optimization Example And Follow-The-Leader

Similar documents
Notes on online convex optimization

1 Review of Zero-Sum Games

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Homework sheet Exercises done during the lecture of March 12, 2014

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Lecture 9: September 25

Lecture 2 October ε-approximation of 2-player zero-sum games

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Approximation Algorithms for Unique Games via Orthogonal Separators

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

KINEMATICS IN ONE DIMENSION

Solutions from Chapter 9.1 and 9.2

Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Two Coupled Oscillators / Normal Modes

Final Spring 2007

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

Lecture 33: November 29

Ensamble methods: Bagging and Boosting

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

EECE 301 Signals & Systems Prof. Mark Fowler

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

STATE-SPACE MODELLING. A mass balance across the tank gives:

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Lecture 20: Riccati Equations and Least Squares Feedback Control

Games Against Nature

Notes for Lecture 17-18

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Some Basic Information about M-S-D Systems

Ensamble methods: Boosting

CS376 Computer Vision Lecture 6: Optical Flow

Notes on Kalman Filtering

Lecture 4: November 13

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

SOLUTIONS TO ECE 3084

Matlab and Python programming: how to get started

MATH 4330/5330, Fourier Analysis Section 6, Proof of Fourier s Theorem for Pointwise Convergence

Chapter 7: Solving Trig Equations

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

= ( ) ) or a system of differential equations with continuous parametrization (T = R

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

The Arcsine Distribution

Chapter 2. First Order Scalar Equations

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Y 0.4Y 0.45Y Y to a proper ARMA specification.

From Complex Fourier Series to Fourier Transforms

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

Linear Response Theory: The connection between QFT and experiments

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence

1 Widrow-Hoff Algorithm

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

Online Learning with Queries

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Math 10B: Mock Mid II. April 13, 2016

Lecture 12: Multiple Hypothesis Testing

Let us start with a two dimensional case. We consider a vector ( x,

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

BU Macro BU Macro Fall 2008, Lecture 4

Lecture 2 April 04, 2018

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

GMM - Generalized Method of Moments

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Chapter 4. Truncation Errors

Lecture Notes 2. The Hilbert Space Approach to Time Series

Math 334 Fall 2011 Homework 11 Solutions

ODEs II, Lecture 1: Homogeneous Linear Systems - I. Mike Raugh 1. March 8, 2004

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

KEY. Math 334 Midterm I Fall 2008 sections 001 and 003 Instructor: Scott Glasgow

Problem Set #3: AK models

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems.

Boosting with Online Binary Learners for the Multiclass Bandit Problem

System of Linear Differential Equations

CHAPTER 12 DIRECT CURRENT CIRCUITS

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

Today: Falling. v, a

Some Ramsey results for the n-cube

Self assessment due: Monday 4/29/2019 at 11:59pm (submit via Gradescope)

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Chapter Floating Point Representation

Linear Time-invariant systems, Convolution, and Cross-correlation

An Introduction to Malliavin calculus and its applications

Comparing Means: t-tests for One Sample & Two Related Samples

10. State Space Methods

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

Energy Storage Benchmark Problems

An introduction to the theory of SDDP algorithm

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2

Seminar 4: Hotelling 2

Transcription:

CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion We firs review he online convex opimizaion problem, before exploring a real-world example. Online Convex Opimizaion for = 1, 2,..., T player chooses w adversary chooses f simulaneously wih he player s decision) player s loss f w ) is compued player observes f 2 Real World Example - Google Search 2.1 Overview In class we explored a real world seing where - A user yped in he query The joy of cooking. - Google s Ad servers responded wih a lis of ads o show o he user. - The user may possibly click one of he ads. We migh hen ask he following quesion. How does Google decide which ads o show o a user? To idenify he online learning aspec of his seing, we firs explore some of he processes run by he ad server. The ad server does he following processes: 1. Maches ads on keywords. For insance, he keyword cooking migh be exraced from he user s query, and may be mached wih ads relaed o cooking. 2. For each ad, we compue feaures x R n, hen query predicion model o esimae he probabiliy of an ad being clicked. Noe ha his means ha we mach and predic on many more ads han we acually show. 3. Run aucion, which considers he click probabiliies compued from he previous sep, as well as he moneary value of he ad. 4. Choose which ads o show. We focus on he learning problem where we are rying o predic he probabiliy of an ad being clicked by a paricular user. 2.2 Example Model In class, we assumed a linear model w R n. Given he weighs, we migh decide o make a predicion like so: ˆp = w x. However, since we are rying o predic probabiliies, we wan o normalize he predicions o be beween 0 and 1. Tha is, we need a funcion fx) such ha f : R [0, 1]. We also wan his funcion 1

o be increasing, so ha higher x will correspond o a higher probabiliy. A common funcion ha achieves his is he sigmoid funcion, σx) = 1 1+e x. Thus, we predic he probabiliy like so: ˆp = σw x). Noe ha his is exacly how a logisic regression model calculaes an inpu example s class probabiliies. To learn he model, we have a raining sysem ha akes in x, y) pairs, where x is he feaure vecor of he ad, and y is he corresponding label, which in his case indicaes wheher or no he user has clicked he ad. Sill using he logisic regression model, if we were o use online gradien descen o opimize he loss funcion, he raining process will look like so. We use 1 o denoe an ad geing clicked, and 0 oherwise o make he mah work ou when using online gradien descen for logisic regression). Training Process - SGD for Logisic Regression for each x, y) ˆp = σw x) g = ˆp y)x w +1 w ηg Noe ha in he real seing, we will have o updae he w used by he ad server wih he new weigh vecors obained from he raining process. 2.3 Problems and Challenges In he real seing, n, he dimension of he feaure vecor, can be in he order of billions. In addiion o his, he online seing raise some challenges: 1. Jus wriing he model w consumes so much space. Moreover, we wan o search for he bes model ou here. 2. Compuing he predicion naively by doing a normal do produc will be oo slow, especially given he consrain ha we wan o give he feedback o he user in he order of hundreds of milliseconds. Forunaely, even hough w is dense, x, he feaure vecor, is usually sparse. This is especially rue if we are using a bag of words represenaion. Then, compuing w x will only involve a number of operaions ha is proporional o he number of non-zero elemens in x, as opposed o n. 3. Even if we perform well on he x, y) examples, we have o remember ha hese examples are obained from ads ha were acually shown. Thus, i migh be he case ha here migh be some unshown ads ha are acually good. A fuure opic relaed o his is exploraion. 2.4 Mapping back o OCO Going back he he OCO formulaion, we see ha w is he weigh vecor oupued by he raining process. Since we chose o use logisic regression as our model, we wan our loss funcion o be relaed o he condiional likelihood. The specific form chosen in class was f = log1 + exp y w x ))). I was commened in class ha his loss funcion can be viewed as a sof version of he hinge loss. The plo below shows his. 2

However, here were also some commens ha OCO does no model some aspecs of he given example. For one, here is a delay before he player acually observes f, since his informaion is only available afer he user has made a clicking decision. 2.5 Commens on using Regre Analysis To review, Regre = f w ) min f w). When comparing beween wo models A and B, we noe ha we can cancel ou he consan erm on he righ: RegreA) RegreB) = f A ) f B ) = LossA) LossB). When using regre as a meric o opimize on, we have o be wary if W has changed when we made changes o a model; a higher regre migh mislead us ino hinking ha a model is worse han he oher, bu his is no necessarily rue. 3 Follow-The-Leader FTL) Algorihm The firs new algorihm ha we explore o apply in he OCO framework is Follow-The-Leader, which was described o be he simples, mos naural, and impracical online learning algorihm. Much of his secion is adaped from he survey of Shai Shalev-Shwarz [1]. Follow-The-Leader w 1 is se arbirarily for = 1, 2,..., T w = argmin s=1 f sw) There are wo ways o view he algorihm. 1. We are playing on round he weigh vecor ha minimizes he regre if he game sopped on round 1. 2. We use all he seen examples as a bach machine learning problem, and solve for he bes weigh vecor. 3

We noe ha he longer we run his algorihm, he slower he algorihm ges, since he bach problem becomes larger and larger Impracical). We now prove our firs regre bound. Noe ha regres can also be expressed as being relaive o a chosen vecor, insead of jus being compared agains he bes model in hindsigh. In he lemma below, Regreu) denoes he regre wih respec o he vecor u. Lemma 1. The poins w 1,..., w T played by FTL saisfy u W, Regreu) T =1 f w ) f w +1 ) ) Proof. We firs resae he lemma. Subsiuing he definiion of Regreu) gives us he following equivalen saemen: T =1 f w ) f u) ) T =1 f w ) f w +1 ) ), which reduces o f w +1 ) =1 f u). In class, he expression on he lef side of he inequaliy was called he cumulaive loss of he Be-The-Leader BTL) algorihm, which cheas by looking ino he nex example. This erm was inroduced by Kalai e al. [2] We now use inducion o prove u W, =1 f w +1 ) =1 f u). Base case. T = 1. f 1 w 2 ) f 1 u) for any u is rue because w 2 minimizes f 1. 2 Recall ha w 2 = argmin s=1 f sw) = argminf 1 w). Inducive Hypohesis. We assume he saemen above holds rue for ime T 1, ha is, =1 u W, f w +1 ) f u). =1 =1 Inducive Sep. From he above I.H., we complee he inducive sep by showing ha he saemen holds rue for ime T, via he following sequence of equivalen inequaliies: f w +1 ) f u) =1 =1 f w +1 ) + f T w T +1 ) f u) + f T w T +1 ) =1 =1 f w +1 ) f u) + f T w T +1 ). Since he saemen above is rue for all u, we can ake u = w T +1 which gives us =1 f w +1 ) =1 =1 f w T +1 ). We can view he erm on he righ side of he inequaliy as a funcion of vecor, fv) = T =1 f v). In his case, he argumen is v = w T +1. We noe ha w T +1 is exacly he minimizer of fv). Thus, for all u, T =1 f w T +1 ) T =1 f u). Wih his final inequaliy, we end he inducive proof. =1 4

We now explore an example where FTL performs poorly. Example 1 Assume ha W [, 1], f w) = g w, where g [, 1] is chosen by he adversary, and he player chooses w. We also abbreviaed s=1 g s as g 1:. Below we show a game beween a player wih he FTL sraegy ha arbirarily picks w 1 = 0, and an adversary ha wans o cause he player o incur losses noe ha he adversary also knows ha he FTL sraegy is deerminisic). We also show he BTL sraegy on he righ hand side for comparison. We consider wha he FTL sraegy will do under his seing, where he poins played are given by w = argmin f s w) w [,1] s=1 = argmin g s w w [,1] s=1 = argmin g 1: w. w [,1] So, he FTL sraegy always chooses he sign opposie o g 1:, wih magniude 1. FTL w g FTL Loss g 1: BTL ŵ = w +1 BTL Loss 1 0 0.5 0 0.5-1 -0.5 2-1 -1 1-0.5 1-1 3 1 1 1 0.5-1 -1 4-1 -1 1-0.5 1-1.............. The regre of FTL is T 1) 0.5) T, which is bad, because we wan he regre o be ot ). Noing ha he cumulaive loss of BTL is T, using he previous bound we jus proved, we ge ha u W, Regreu) T 0.5) T + 0.5) = 2T 1. The upperbound is above T, which makes his regre bound a weak one. Towards he end of he lecure, i was hined ha wih he Follow-The-Regularized-Leader algorihm, we can improve he sabiliy of he FTL algorihm by adding a regularizing erm. Bu ha s for he nex lecure! References [1] Shai Shalev-Shwarz, Online Learning and Online Convex Opimizaion, Foundaions and Trends in Machine Learning, 2012. [2] Adam Kalai, Sanosh Vempala, Efficien Algorihms for online decision problems, Journal of Compuer and Sysem Sciences, 2005. 5