General Weighted Majority, Online Learning as Online Optimization

Similar documents
Advanced Machine Learning & Perception

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

Variants of Pegasos. December 11, 2009

Introduction to Boosting

Lecture 11 SVM cont

Lecture VI Regression

Lecture 6: Learning for Control (Generalised Linear Regression)

CHAPTER 10: LINEAR DISCRIMINATION

An introduction to Support Vector Machine

Lecture 2 L n i e n a e r a M od o e d l e s

FTCS Solution to the Heat Equation

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

CS286.2 Lecture 14: Quantum de Finetti Theorems II

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Linear Response Theory: The connection between QFT and experiments

Normal Random Variable and its discriminant functions

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Department of Economics University of Toronto

Robust and Accurate Cancer Classification with Gene Expression Profiling

Online Supplement for Dynamic Multi-Technology. Production-Inventory Problem with Emissions Trading

Mechanics Physics 151

CS 268: Packet Scheduling

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

Clustering (Bishop ch 9)

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

Tight results for Next Fit and Worst Fit with resource augmentation

Comparison of Differences between Power Means 1

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Epistemic Game Theory: Online Appendix

Cubic Bezier Homotopy Function for Solving Exponential Equations

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

( ) [ ] MAP Decision Rule

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Fall 2010 Graduate Course on Dynamic Learning

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

On One Analytic Method of. Constructing Program Controls

Solution in semi infinite diffusion couples (error function analysis)

Let s treat the problem of the response of a system to an applied external force. Again,

Volatility Interpolation

A Generalized Online Mirror Descent with Applications to Classification and Regression

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Advanced time-series analysis (University of Lund, Economic History Department)

Testing a new idea to solve the P = NP problem with mathematical induction

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

1 Review of Zero-Sum Games

CHAPTER 5: MULTIVARIATE METHODS

Notes on the stability of dynamic systems and the use of Eigen Values.

Displacement, Velocity, and Acceleration. (WHERE and WHEN?)

WebAssign HW Due 11:59PM Tuesday Clicker Information

Bandlimited channel. Intersymbol interference (ISI) This non-ideal communication channel is also called dispersive channel

( ) () we define the interaction representation by the unitary transformation () = ()

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

Chapter 6: AC Circuits

Online Appendix for. Strategic safety stocks in supply chains with evolving forecasts

SOME NOISELESS CODING THEOREMS OF INACCURACY MEASURE OF ORDER α AND TYPE β

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Second-Order Non-Stationary Online Learning for Regression

Chapter Lagrangian Interpolation

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

Panel Data Regression Models

Boosted LMS-based Piecewise Linear Adaptive Filters

Mechanics Physics 151

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Bell Laboratories. 600 Mountain Avenue

Math 128b Project. Jude Yuen

Notes for Lecture 17-18

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Mechanics Physics 151

Chapter 4. Neural Networks Based on Competition

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

10. A.C CIRCUITS. Theoretically current grows to maximum value after infinite time. But practically it grows to maximum after 5τ. Decay of current :

Dual Approximate Dynamic Programming for Large Scale Hydro Valleys

Motion in Two Dimensions

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

Machine Learning 2nd Edition

CHAPTER 2: Supervised Learning

Online Boosting Algorithms for Multi-label Ranking

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Labs. 180 Park Avenue. Florham Park, NJ 07932

PHYS 1443 Section 001 Lecture #4

Computing Relevance, Similarity: The Vector Space Model

72 Calculus and Structures

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Pattern Classification (III) & Pattern Verification

Supervised Learning in Multilayer Networks

Track Properities of Normal Chain

Genetic Algorithm in Parameter Estimation of Nonlinear Dynamic Systems

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

TSS = SST + SSE An orthogonal partition of the total SS

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

Matlab and Python programming: how to get started

Graduate Macroeconomics 2 Problem set 5. - Solutions

Transcription:

Sascal Technques n Robocs (16-831, F10) Lecure#10 (Thursday Sepember 23) General Weghed Majory, Onlne Learnng as Onlne Opmzaon Lecurer: Drew Bagnell Scrbe: Nahanel Barshay 1 1 Generalzed Weghed majory 1.1 Recap In general onlne learnng, we canno hope o make guarenees abou loss n any absolue sense. Insead, we use he noon of regre R o compare he loss of our algorhm (alg) agans ha of he bes exper (e ) from some famly of expers. We hus defne regre as: R = l (alg) l (e ) (1) The frs of such algorhms analyzed was Weghed Majory, whch works on 0/1 loss, and acheves a number of msakes m bounded by: m 2.4(m + log N) (2) where N s he number of expers and m s he number of msakes he bes exper n rerospec makes over he enre me horzon. Nex we looked a Randomzed Weghed Majory, whch s smlar o WM bu uses a weghed draw (raher hen a weghed average), o make a predcon a a gven mesep, and also nroduces a learnng rae β. 1.2 The Maser Learnng Algorhm: Generalzed Weghed Majory The RWM algorhm menoned above assumes a bnary loss funcon: l {0, 1}. We now generalze o any loss funcon wh oupus n [0, 1] (keepng RWM as a specal case). The algorhm s: 1. Se w 0 = 1. 2. A me, pck an exper e n proporon o s wegh w, and le ha exper decde. 3. Adjus he exper weghs: w +1 w e ɛl(e ) The bound on he regre of hs algorhm becomes E[R] ɛ l (e ) + 1 ln N (3) ɛ 1 Conen adaped from prevous scrbes: Anca Drăgan, Jared Goerner. 1

Ideally, we would lke hs algorhm o be wha s called No Regre, defned as he average regre over me convergng o 0: R T T 0 (4) To do so, we need o make sure ha ɛ(t ) decays n such a way ha he regre grows less han lnearly as a funcon of T. Snce l(e ) 1 by defnon, we have ha l (e ) O(T ). Therefore, applyng hs o (3), we ge: Seng ɛ = 1 T, we ge ha E[R] O(ɛT + 1 ln N) (5) ɛ E[R] O( T + T ln N) (6) Ths grow sublnearly n T, hus he rao n (4) ends owards 1 T, and we have shown a no reger algorhm. Of course, hs requres knowng T beforehand, urns ou one can also acheve no regre (va a harder proof) by varyng ɛ wh me:. ɛ = 1 Noe: Ths s he pon where Drew says ha hs algorhm can solve any problem n he unverse. For more nformaon on General Weghed Majory, refer o he orgnal paper by Arora e.al[1]. Ths algorhm has many suprsng applcaons: compuaonal geomery, Adaboos (where he expers are he daa pons), and even Yao s XOR lemma (see [2] for more deals) n complexy heory. 1.3 Applcaon: Self Superfzed Learnng Suppose we have a robo (drvng n 1-dmenson) ha wans o learn o denfy objecs a long range, gven ha can denfy objecs perfecly a shor range. Such a sensor model s que common, we have far less nformaon (and hus classfcaon s far more dffcul) when objecs are far away. I mgh be desrable o learn such a model n a pah plannng seng. Le us assume ha every observed obsacle s eher a Tree or a Gan Carro (and hus he dffculy of classfcaon s que undersandable). The formal onlne learnng s as follows: we ge feaures (from an objec a range) and decde a class (Tree/Carro) from he feaures avalable, hen we drve close o he objec and he world gves us he rue classfcaon. We wll use 0/1 Loss (0 f correc, 1 f ncorrec). Almos any famly of classfers can be used for our se of expers (decson rees, lnear classfer). However, we mus dscreze he parameers of such learners o keep he number of expers fne. 2

1.4 Example: Lnear Classfers The general form of a bnary lnear classfer s θ T f 0 1 θ T f < 0 1 Here, f s a feaure vecor and θ s he vecor of weghs. Noe ha f we asser θ = 1, hen hea essenally has only d 1 parameers, snce he las s redundan (and each θ [ 1, 1]). In order o have a fne famly of expers, we mgh dscreze each θ no b levels, and have an exper for each combnaons of θ. In hs case we have N = (b 1) d, and we can run GWM verbam. Pluggng he number of expers no (3), we ge ha: E[R] ɛ l (e ) + 1 ) (O(b ɛ ln d ) = ɛ l (θ ) + 1 O(d ln(b)) (7) e Therefore, he regre scales lnearly n he dmenson of he feaure space. I urns ou s generally rue ha we need O(n) samples for a lnear classfer (When nroducng he consan becomes abou 10n). Ths s grea heorecally, bu keep n mnd we sll need o rack weghs for each of O(b d ) expers! Thus he algorhm s only praccal for small d (upper bound a abou 4). 2 Onlne Learnng as Onlne Opmzaon Movaon: In fac, he grea waershed n opmzaon sn beween lnear and nonlnear, bu beween convexy and nonconvexy Mos mporanly, f we use: 1. Convex ses of expers 2. Convex loss funcons hen we may be able o solve our onlne learnng problem effcenly n he realm of onlne opmzaon. 2.1 Wha are Convex Ses and Funcons? A convex se s a se such ha any lnear combnaon of wo pons n he se s also n he se: f A C, B C, hen θa + (1 θ)b C (8) For example, he permer of a crcle s no convex, because a lnear combaon of wo pons s a chord, whch passes hrough he eneror (whch s no n he se). Examples of convex ses nclude: 3

Un ball n R n under l 2 norm. S = { x R} Box n R n. S = { x R} General un ball. S = { x R 1} Lnear subspace Half space. S = {w x b} Inersecon of half spaces, I.E. polyhedron. Cone, I.E. all posve lnear combnaons of a se of vecors. Convex funcons are he funcons for whch he epgraph (he area above he curve) s a convex se. Defned rgerously we have: f(a)θ + (1 θ)f(b) f(θa + (1 θ)b) (9) Ths drecly generalzes o Jensen s nequaly: θ = 1 = f (θ 1 x 1 + + θ n x n ) θ f(x ) 2.2 Subgradens Convex funcons have subgradens a every pon n her doman. A subgraden f(x) s a subgraden a x f s he normal o some plane ha ouches f(x) a x, and s below he res of f. In symbols: f(y) f(x) + f(x) T (y x) y (10) If a funcon s dfferenable a a pon, hen has a unque subgraden a ha pon. Furhermore, convex funcons are he max over all subgradens. Ths s an neresng propery ha wll be used laer n he class, because he maxmum of convex funcons s convex. Several key properes of convex funcons follow: Any local mnma s also a global mnmum (no necessarly he unque global mnmum). Ths s easy o see, he subgraden a a local mnma ses a lower bound on he funcons value. Local opmzaon never ges suck (we can always follow a subgraden down, unless already a a global mn) 4

2.3 The Onlne Convex Programmng Problem - Inro Onlne Convex Programmng was proposed by Marn Znkevch[3] n 2003. I s framed n he same conex of me seps, loss funcon, expers and weghed majory, preservng all he same quales form WM, whle beng compuaonally feasble. The dea s ha he expers are elemens of some convex se, and ha he loss a me s convex over he se of expers and hus has a subgraden. A every me sep, we need o predc an exper x and receve he loss l (x ) and l (x ). Example: for he case of he lnear classfer, where he expers are x = θ n some convex se, he loss funcon could be l (θ ) = (θ T f y ) 2 where y s he acual label for he daa pon f, n { 1, 1}. Ths loss s convex and s n fac a parabola n erms of θ. The nex lecure wll formalze he Onlne Convex Programmng Problem beer, and explan s applcaons o No Regre Porfolo creaon. References [1] Sanjeev Arora, Elad Hazan, and Sayen Kale, The mulplcave weghs updae mehod: A mea algorhm and s applcaons. Techncal repor, The Prnceon Unversy [2] O Goldrech, N Nsan, A Wgderson, On Yao s XOR-lemma [3] Marn Znkevch, Onlne Convex Programmng and Generalzed In nesmal Graden Ascen, ICML 2003 5