CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Similar documents
Online Classification: Perceptron and Winnow

3.1 ML and Empirical Distribution

Lecture 10 Support Vector Machines II

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

1 The Mistake Bound Model

COS 521: Advanced Algorithms Game Theory and Linear Programming

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Feature Selection: Part 1

Generalized Linear Methods

Which Separator? Spring 1

Assortment Optimization under MNL

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

Lecture Notes on Linear Regression

Online Linear Regression using Burg Entropy

1 Convex Optimization

Lecture 4. Instructor: Haipeng Luo

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kernel Methods and SVMs Extension

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Maximal Margin Classifier

Support Vector Machines CS434

The Experts/Multiplicative Weights Algorithm and Applications

Lecture 14: Bandits with Budget Constraints

Lecture 10 Support Vector Machines. Oct

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

10-701/ Machine Learning, Fall 2005 Homework 3

Section 8.3 Polar Form of Complex Numbers

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Natural Language Processing and Information Retrieval

Efficient Bregman Projections onto the Simplex

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

CSC 411 / CSC D11 / CSC C11

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

A Note on Bound for Jensen-Shannon Divergence by Jeffreys

Supporting Information

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Support Vector Machines

The Order Relation and Trace Inequalities for. Hermitian Operators

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Economics 101. Lecture 4 - Equilibrium and Efficiency

Lecture 20: November 7

The Geometry of Logit and Probit

CSE 546 Midterm Exam, Fall 2014(with Solution)

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

PHYS 705: Classical Mechanics. Calculus of Variations II

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Some basic inequalities. Definition. Let V be a vector space over the complex numbers. An inner product is given by a function, V V C

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

The Minimum Universal Cost Flow in an Infeasible Flow Network

Lecture 10: May 6, 2013

1 Definition of Rademacher Complexity

Lecture 4: Universal Hash Functions/Streaming Cont d

ECE559VV Project Report

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

Linear Approximation with Regularization and Moving Least Squares

CSCI B609: Foundations of Data Science

Regret in Online Combinatorial Optimization

Computing Correlated Equilibria in Multi-Player Games

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Learning Theory: Lecture Notes

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

The Expectation-Maximization Algorithm

arxiv: v1 [quant-ph] 6 Sep 2007

6.854J / J Advanced Algorithms Fall 2008

18.1 Introduction and Recap

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

The Second Anti-Mathima on Game Theory

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Support Vector Machines

} Often, when learning, we deal with uncertainty:

1 Motivation and Introduction

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

Dimensionality Reduction Notes 1

Ensemble Methods: Boosting

The proximal average for saddle functions and its symmetry properties with respect to partial and saddle conjugacy

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

4DVAR, according to the name, is a four-dimensional variational method.

Gaussian Mixture Models

Chapter Newton s Method

Canonical transformations

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Why BP Works STAT 232B

MMA and GCMMA two methods for nonlinear optimization

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

14 Lagrange Multipliers

More metrics on cartesian products

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Transcription:

CS 294-128: Algorthms and Uncertanty Lecture 14 Date: October 17, 2016 Instructor: Nkhl Bansal Scrbe: Antares Chen 1 Introducton In ths lecture, we revew results regardng follow the regularzed leader (FTRL. We then begn to dscuss a new onlne convex optmzaton algorthm known as mrror descent. Frst, we construct the ntuton behnd the algorthm by ntroducng Bregman dvergence. We then dscuss the mechancs of the mrror descent algorthm, show remarkable equvalence wth FTRL, and provde an example applcaton. Fnally, we relate onlne mrror decent to Fenchel Dualty and provde some ntuton behnd usng Bregman dvergence as a dstance metrc. 2 Revew 2.1 Settng For the past few lectures, we have dscussed onlne convex optmzaton (OCO. The problem specfcatons are as follows. We are gven some decson doman modeled as a convex set K n Eucldean space. At each tme step t, the player s ht wth a convex cost functon f t : K R. The player then chooses x t such that the regret s mnmzed. regret = t f t (x t mn y K f t (y For the remander of these notes, we denote = f (x and assume all cost functons are lnear. Our regret analyss wll also depend on the noton of dameter whch we now defne. Defnton 1 The dameter wth respect to R s gven by 2.2 Follow the regularzed leader D R = maxr(x R(y} x,y Prevously, we dscussed an onlne convex optmzaton algorthm known as follow the regularzed leader (FTRL whch was ntroduced n [5][6]. The analyss of onlne mrror descent wll rely heavly on ths topc and so we revew the algorthm here. At the t-th tme step, the next value x t+1 s chosen based on ths update rule. x t+1 = argmn η( 1 + + t x + R(x } Here, R(x s a regularzaton functon that s often chosen to be α-strongly convex wth respect to some norm. Analyss based on the regme be the leader (BTL [2] yelded the followng regret bound. t 1

Theorem 2 Let denotes the dual norm wth respect to. If R(x s α-strongly convex wth respect to. Then the regret for FTRL s bounded as follows. regret t 2η α t 2 R(y R(x + η 3 Onlne Mrror Descent We now ntroduce onlne mrror descent (OMD whch s an onlne varant of Nemrovsk and Yudn s mrror descent algorthm [4]. Frst dscussed by [7], OMD s very smlar onlne gradent descent as the algorthm computes the current decson teratvely based on a gradent update rule and the prevous decson. However, the power behnd OMD les n the update beng carred out n a dual space, defned by our choce of regularzer. Ths follows from consderng R as a mappng from R n onto tself. When carryng out the update n ths space, we take advantage of a rch geometry defned only n the dual. Indeed, ths has lead to dscoveres that show many algorthms to be specal cases of onlne mrror descent [3][9]. More recently, t has been dscovered that not only does onlne mrror descent apply to a general class of onlne convex optmzaton problems, but that they do so wth optmal regret bounds [8]. 3.1 The algorthm Onlne mrror descent wll rely on Bregman dvergence. Defnton 3 Denote B R (x y as the Bregman dvergence between x and y wth respect to the functon R. Ths s gven as B R (x y = R(x R(y R(y (x y We mmedately have the noton of a Bregman projecton of y onto a convex set K. argmn B R (x y We are now ready to dscuss onlne mrror descent. The algorthm takes n as nput the learnng rate η > 0 and regularzaton functon R(x. Graphcally, the algorthm runs as follows. 2

The pseudocode s provded below. Algorthm 1 Onlne mrror descent 1: Intalze y 1 to be such that R(y 1 = 0 and x 1 = argmn B R (x y 1 2: for t = 1 T do 3: Play x t and receve cost functon f t 4: Update y t accordng to the followng rule 5: Bregman project back to K 6: end for R(y t+1 = R(y t η t x t+1 = argmn B R (x y t+1 In terms of mplementaton, y t+1 may be recovered by applyng the nverse gradent mappng ( R 1. In general, f R s α-strongly convex, then R must be a bjectve mappng. 3.2 Regret analyss Hazan and Kale [1] provded an extraordnary result equatng FTRL wth OMD. Ths theorem, whch we now prove below, wll n the future allow us to bootstrap theorem 2 and provde regret bounds for onlne mrror descent. Theorem 4 Gven that R s α-strongly convex, the lazy OMD and FTRL algorthms produce equvalent predctons. argmn ( B R (x y t+1 = argmn η t s x + R(x Proof: Observe that n lazy OMD, y t+1 s updated wth respect to the followng constrant R(y t+1 = R(y t η t. Ths gves us the followng. y t+1 = ( R 1( R(y t η t = ( R 1( R(y t 1 η t 1 η t = ( R 1( t η s Consder the case where y t+1 K mplyng that the projecton s tself. In the OMD regme, we have that x t+1 = y t+1. Denote the FTRL update as Φ t = t η s x + R(x. Takng the gradent gves us the followng. Φ t = t η s + R(x 3

However, n FTRL after ths quantty s mnmzed, we must have Φ t = 0. t R(x = η s x = ( R 1( t η s Whch s exactly y t+1. Now f y t+1 / K, we must then Bregman project back to K. Ths s gven by defnton, but snce we mnmze wth respect to x, terms ndependent of ths varable can be elmnated gvng us the followng. argmn B R (x y t+1 = argmn R(x R(yt+1 R(y t+1 (x y t+1 } R(x R(yt+1 x } = argmn = argmn t R(x + η s x } In all cases, the updates for OMD and FTRL are equvalent. Thus the theorem holds. 4 Experts From Onlne Mrror Descent As stated prevously, many algorthms occur as specal cases of onlne mrror descent. We now showcase the results of [3]. Recall the setup for experts. At tme t a probablty dstrbuton p t s mantaned on k experts and a loss vector l t s revealed. Our goal s to maxmze the probablty of pckng the expert who ncurs mnmal loss over T tme steps. 4.1 Exponentated gradent algorthm Let x( be the -th component of x and our regularzaton functon be the negatve entropy functon R(x = x( log x(. We then have that R(x = (log x( + 1. From the OMD algorthm, our update rule for y t+1 s then the followng. R(y t+1 = R(y t η t (log y t+1 ( + 1 = (log y t ( + 1 η t log y t+1 ( = log y t ( η t y t+1 ( = y t (e η t Recall that n the expert settng, our convex set K s smply the n-dmensonal smplex defned as n = x R n : x( = 1}. We make two crtcal observatons. By theorem 6, the Bregman dvergence wth respect to the negatve entropy functon becomes relatve entropy. Ths s also known as Kullback-Lebler (KL dvergence. 4

By theorem 7, the Bregman projecton wth respect to the negatve entropy functon becomes scalng by the l 1 -norm. We have fully defned a specal case of the OMD update regme called the exponentated gradent algorthm. Algorthm 2 Exponentated gradent 1: Intalze y 1 = 1 and x 1 = y 1 y 1 1 2: for t = 1 T do 3: Play x t and receve cost functon f t 4: Update y t accordng to the followng rule y t+1 ( = y t (e η t( 5: Bregman project back to K 6: end for x t+1 = y t+1 y t+1 1 Prevously, we provded a multplcatve weght update method for expert learnng and proved regret bounds usng a potental functon argument. However, here the algorthm drectly falls out of OMD as a specal case! 4.2 Regret analyss We have demonstrated that OMD s equvalent to FTRL and so we may bootstrap theorem 2 to bound the regret of exponentated gradent. Theorem 5 Suppose all expert costs are 0-1 bounded: l t ( [0, 1]. Then the regret for the exponentated gradent algorthm s gven by regret O ( T log n Proof: Frst, substtute R(y R(x wth dameter. By theorem 2, we have the followng. regret t 2η α t 2 + D2 R η Dfferentate wth respect to η and mnmze the above expresson. η = αd 2 R 2 t t 2 regret 2D R t 2 α t 2 Observe that f all expert costs are n the range [0, 1], then the cost gradent must be bounded n the followng manner. t = t 1 5

By Pnsker s nequalty (theorem 8, the negatve entropy functon s strongly convex wth respect to the l 1 -norm. However, the dual of the l 1 -norm s the l -norm, whch follows from generalze Cauchy-Schwartz. Addtonally, the negatve entropy functon s α-strongly convex where α = 1 2 ln 2. Usng Jensen s nequalty, one may show that D R log n on the smplex n. Our regret s now the followng. regret 2D R t 2 α t 2 = 2 t 2 log n 2 ln 2 = 2 T log n ln 2 = O ( T log n Thus completes our analyss. References [1] E. Hazan and S. Kale. Extractng certanty from uncertanty: Regret bounded by varaton n costs. In The 21st Annual Conference on Learnng Theory (COLT, pages 5768, 2008. [2] A. Kala and S. Vempala. Effcent algorthms for onlne decson problems. Journal of Computer and System Scences, 71(3:291307, 2005. [3] J. Kvnen and M. Warmuth. Exponentated gradent versus gradent descent for lnear predctors. Informaton and Computaton, 132(1:164, 1997. [4] A. Nemrovsk and D. Yudn. On cesaros convergence of the gradent descent method for fndng saddle ponts of convex-concave functons. Doklady Akadem Nauk SSSR, 239(4, 1978. [5] S. Shalev-Shwartz. Onlne Learnng: Theory, Algorthms, and Applcatons. The Hebrew Unversty of Jerusalem, PhD thess, 2007. [6] S. Shalev-Shwartz and Y. Snger. A prmal-dual perspectve of onlne learnng algorthms. Machne Learnng, 69(2-3:115142, 2007. [7] S. Shalev-Shwartz and Y. Snger. Convex repeated games and fenchel dualty. Advances n Neural Informaton Processng Systems, 19:1265, 2007. [8] N. Srebro, K. Srdharan, and A. Tewar. On the unversalty of onlne mrror descent. Advances n Neural Informaton Processng Systems pages. 2645-2653, 2001. [9] M. Znkevch. Onlne convex programmng and generalzed nfntesmal gradent ascent. ICML, 2003. 6

A The Negatve Entropy Functon In ths secton we provde calculatons that show propertes relevant to usng negatve entropy as the regularzer. Theorem 6 Let R(x = x( log x(. We have the followng. B R (x y = ( x( x( log x( + y( y( Proof: Calculatons follow from defnton. Note that R(x = (log x( + 1. B R (x y = R(x R(y R(y (x y = x( log x( y( log y( ( log y( + 1 ( x( y( x( + y( = = x( log x( x( log ( x( y( x( log y( x( + y( The theorem holds. Notceably, we have that the Bregman dvergence of negatve entropy s smply KL-dvergence. Gven ths formulaton we prove the followng. Theorem 7 Let R(x = x( log x(. Then B R(x y subject to x n s mnmzed at the followng pont. y( x = y 1 Proof: We wsh to mnmze the followng expresson wth respect to x subject to x( = 1. x = argmn x n x( log ( x( y( x( + } y( Ths s easly done usng Lagrange multplers. Let F be defned as follows. F (x, λ = x( log ( x( 1 + y( One can show that F/ x( = 0 at the followng values. x( = y(e λ 1 λ = ( y( λ x( 1 1 ln y( + 1 Substtutng n gves us the theorem. Ths gves us the nterpretaton that the Bregman projecton wth respect to negatve entropy on the n-dmensonal smplex becomes scalng by the l 1 -norm. 7

B Pnsker s nequalty In ths secton, we prove Pnsker s nequalty whch gves us the fact that negatve entropy s α-strongly convex wth respect to the l 1 -norm gven α = 1 2 ln 2. Theorem 8 Let P and Q be two dstrbutons defned on the sample space Ω. Then the followng holds. D KL (P Q 1 2 ln 2 P Q 2 1 Proof: We frst show the theorem holds for the case where P and Q are Bernoull dstrbutons. Let p, q [0, 1] and P, Q gven by the followng. P = 1 w.p. p 0 w.p. 1 p Q = 1 w.p. q 0 w.p. 1 q Wthout loss of generalty, let p q and defne f to be the followng. f(p, q = D KL (P Q 1 2 ln 2 P Q 2 1 = p log p q 1 p 4(p q2 + (1 p log 1 q 2 ln 2 And observe that we have f(p, q = 0 when p = q and f(p, q 0 when q p. Furthermore, the followng holds. f q = p q ( 1 ln 2 q(1 q 4 We conclude that D KL (P Q 1 2 ln 2 P Q 2 1. Now consder the case where P and Q are dstrbuted arbtrarly on Ω. Let A Ω be such that A = x : P (x Q(x} and defne the followng random varables. 1 w.p. x A P A = P (x 1 w.p. x A 0 w.p. x/ A P (x Q A = Q(x 0 w.p. x/ A Q(x We then have the followng. P Q 1 = P (x Q(x x Ω = ( ( P (x Q(x + Q(x P (x x A x/ A = P (x ( Q(x + 1 P (x x A x/ A x A = P A Q A 1 ( 1 Q(x x/ A Now defne the random varable Z to be Z(x = 1 f x A else Z(x = 0. It follows that D KL (P Q = D KL (P (Z Q((Z+D KL (P Q Z. However, D KL (P (Z Q((Z = D KL (P A Q A and D KL (P Q Z 0, we must have the followng. D KL (P Q D KL (P A Q A 1 2 ln 2 P A Q A 2 1 = 1 2 ln 2 P Q 2 1 Thus we complete the proof. 8