Advanced Machine Learning

Similar documents
On-Line Learning with Path Experts and Non-Additive Losses

Learning with Large Number of Experts: Component Hedge Algorithm

Advanced Machine Learning

Advanced Machine Learning

Learning, Games, and Networks

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Exponential Weights on the Hypercube in Polynomial Time

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Advanced Machine Learning

Online Dynamic Programming

Online Learning for Time Series Prediction

Advanced Machine Learning

Time Series Prediction & Online Learning

Advanced Machine Learning

Weighted Finite-State Transducer Algorithms An Overview

Perceptron Mistake Bounds

Minimax Fixed-Design Linear Regression

A survey: The convex optimization approach to regret minimization

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

arxiv: v4 [cs.lg] 22 Oct 2017

Move from Perturbed scheme to exponential weighting average

Minimax strategy for prediction with expert advice under stochastic assumptions

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

Structured Prediction

Applications of on-line prediction. in telecommunication problems

Yevgeny Seldin. University of Copenhagen

Adaptive Online Gradient Descent

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Path Kernels and Multiplicative Updates

Large-Scale Training of SVMs with Automata Kernels

The Online Approach to Machine Learning

Hedging Structured Concepts

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences

From Bandits to Experts: A Tale of Domination and Independence

Optimization for Machine Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

No-Regret Algorithms for Unconstrained Online Convex Optimization

Adaptive Online Learning in Dynamic Environments

Online Forest Density Estimation

Bandits for Online Optimization

The convex optimization approach to regret minimization

Anticipating Concept Drift in Online Learning

The on-line shortest path problem under partial monitoring

Time Series Prediction and Online Learning

The No-Regret Framework for Online Learning

Learning Ensembles of Structured Prediction Rules

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

N-Way Composition of Weighted Finite-State Transducers

Defensive forecasting for optimal prediction with expert advice

Efficient learning by implicit exploration in bandit problems with side observations

Exponentiated Gradient Descent

Bandit Online Convex Optimization

Boosting Ensembles of Structured Prediction Rules

A Second-order Bound with Excess Losses

Introduction to Finite Automaton

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

Learning Permutations with Exponential Weights

Accelerating Online Convex Optimization via Adaptive Prediction

Logarithmic Regret Algorithms for Online Convex Optimization

Optimal and Adaptive Online Learning

Weighted Finite-State Transducers in Computational Biology

Advanced Machine Learning

Regret in Online Combinatorial Optimization

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Finite-State Transducers

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Worst-Case Bounds for Gaussian Process Models

Online Learning and Online Convex Optimization

Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Full-information Online Learning

Minimax Policies for Combinatorial Prediction Games

Structured Prediction Theory and Algorithms

Unified Algorithms for Online Learning and Competitive Analysis

L p Distance and Equivalence of Probabilistic Automata

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

CSC242: Intro to AI. Lecture 21

The canadian traveller problem and its competitive analysis

Learning Weighted Automata

Domain Adaptation for Regression

Online Submodular Minimization

Foundations of Machine Learning

Online Submodular Minimization

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Online Linear Optimization via Smoothing

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Distributed Learning in Routing Games

Ensemble Methods for Structured Prediction

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Online Learning for Non-Stationary A/B Tests

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Adaptive Online Prediction by Following the Perturbed Leader

No-Regret Learning in Convex Games

Efficient Minimax Strategies for Square Loss Games

Transcription:

Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Problem Learning guarantees: R T = O( p T log N). informative even for N very large. O(N) Problem: computational complexity of algorithm in. Can we derive more efficient algorithms when experts admit some structure and when loss is decomposable? page 2

Example: Online Shortest Path Problems: path experts. sending packets along paths of a network with routers (vertices); delays (losses). car route selection in presence of traffic (loss). page 3

Outline RWM with Path Experts FPL with Path Experts page 4

Path Experts e01 1 e14 0 e02 2 e03 e23 e24 3 e34 4 e01:a 1 e14:a 0 e02:b 2 e03:a e23:a e24:b 3 e34:a 4 page 5

Additive Loss = e 02 e 23 e 34 For path, l t ( ) =l t (e 02 )+l t (e 23 )+l t (e 34 ). e01:a 1 e14:a 0 e02:b 2 e03:a e23:a e24:b 3 e34:a 4 page 6

RWM + Path Experts Weight update: at each round, update weight of path expert : =e 1 e n w t [ ] w t 1 [ ] e l t( ) w t [e i ] w t 1 [e i ] e l t(e i ). t ; equivalent to (Takimoto and Warmuth, 2002) e01 1 e14 w t 1 [e 14 ] e l t(e 14 ) 0 e02 2 e03 e23 e24 3 e34 4 Sampling: need to make graph/automaton stochastic. page 7

Weight Pushing Algorithm Weighted directed graph with set of initial vertices and final vertices : I Q q 2 Q for any, e 2 E G =(Q, E, w) F Q for any with, w[e] d[q] = X 2P (q,f) d[orig(e)] 6= 0 d[orig(e)] w[ ]. 1 w[e] d[dest(e)]. (MM 1997; MM, 2009) for any q 2 I, initial weight (q) d(q). page 8

Illustration 0 a/0 b/1 c/5 d/0 1 e/0 f/1 e/4 3 e/1 2 f/5 0/15 a/0 b/(1/15) c/(5/15) d/0 1 e/0 f/1 e/(4/9) 3 e/(9/15) 2 f/(5/9) page 9

Properties Stochasticity: for any with, X e2e[q] w 0 [e] = Invariance: path weight preserved. Weight of path from to : I F q 2 Q d[q] 6= 0 X w[e] d[dest(e)] = d[q] d[q] d[q] =1. e2e[q] (orig(e 1 ))w 0 [e 1 ] w 0 [e n ] = d[orig(e 1 )] w[e 1]d[dest(e 1 )] d[orig(e 1 )] = w[e 1 ] w[e n ]d[dest(e n )] = w[e 1 ] w[e n ]=w[ ]. w[e 2 ]d[dest(e 2 )] d[dest(e 1 )] =e 1 e n page 10

Shortest-Distance Computation Acyclic case: special instance of a generic single-source shortestdistance algorithm working with an arbitrary queue discipline and any k-closed semiring (MM, 2002). linear-time algorithm with the topological order queue O( Q + E ) discipline,. page 11

Generic Single-Source SD Algo. Gen-Single-Source(G, s) 1 for i 1 to Q do 2 d[i] r[i] 0 3 d[s] r[s] 1 4 Q {s} 5 while Q 6= ; do 6 q Head(Q) 7 Dequeue(Q) 8 r 0 r[q] 9 r[q] 0 10 for each e 2 E[q] do 11 if d[n[e]] 6= d[n[e]] (r 0 w[e]) then 12 d[n[e]] d[n[e]] (r 0 w[e]) 13 r[n[e]] r[n[e]] (r 0 w[e]) 14 if n[e] 62 Q then 15 Enqueue(Q,n[e]) (MM, 2002) page 12

Shortest-Distance Computation General case: all-pairs shortest-distance algorithm in (p, q) pairs of vertices, d[p, q] = X 2P (p,q) w[ ]. (+, ) ; for all generalization of Floyd-Warshall algorithm to nonidempotent semirings (MM, 2002). O( Q 3 ) O( Q 2 ) time complexity in, space complexity in. alternative: approximation using generic single-source shortest-distance algorithm (MM, 2002). page 13

Generic All-Pairs SD Algorithm Gen-All-Pairs(G) 1 for i 1 to Q do 2 for j 1 to Q M do 3 d[i, j] e2e\p (i,j) w[e] 4 for k 1 to Q do 5 for i 1 to Q,i6= k do 6 for j 1 to Q,j 6= k do 7 d[i, j] d[i, j] (d[i, k] d[k, k] d[k, j]) 8 for i 1 to Q,i6= k do 9 d[k, i] d[k, k] d[k, i] 10 d[i, k] d[i, k] d[k, k] 11 d[k, k] d[k, k] In-place version. (MM, 2002) page 14

Learning Guarantee N Theorem: let be total number of path experts and an upper bound on the loss of a path expert. Then, the (expected) regret of RWM is bounded as follows: L T apple L min T +2M p T log N. M page 15

Exponentiated Weighted Avg Computation of the prediction at each round: Two single-source shortest-distance computations: edge weight (denominator). edge weight (numerator). by t = w t [e] w t [e]y t [e] P 2P (I,F) w t[ ]y t, P 2P (I,F) w t[ ]. page 16

FPL + Path Experts Weight update: at each round, update weight of edge, w t [e] w t 1 [e]+l t (e). Prediction: at each round, shortest path after perturbing each edge weight: e w 0 t[e] w t [e]+p t (e), where p t U([0, 1/ ] E ) or p t Laplacian with density f(x) =. 2 e kxk 1 page 17

Learning Guarantees [0, 1] l max Theorem: assume that edge losses are in. Let be the length of the longest path from I to F and Man upper bound on the loss of a path expert. Then, the (expected) regret of FPL is bounded as follows: E[R T ] apple 2 p l max M E T apple 2l max p E T. the (expected) regret of FPL* is bounded as follows: q E[R T ] apple 4 L min T E l max(1 + log E )+4 E l max (1 + log E ) apple 4l max p T E (1 + log E )+4 E lmax (1 + log E ) = O(l max p T E log E ). page 18

Proof For FPL, use bound of previous lectures with X 1 = E W 1 = l max R = M apple l max. For FPL*, use bound of previous lecture with X 1 = E W 1 = l max N = E. page 19

Computational Complexity For an acyclic graph: T updates of all edge weights. T runs of a linear-time single-source shortest-path. O(T ( Q + E )) overall. page 20

Extensions Component hedge algorithm (Koolen, Warmuth, and Kivinen, 2010): optimal regret complexity: R T = O(M p T log E ). special instance of mirror descent. Non-additive losses (Cortes, Kuznetsov, MM, Warmuth, 2015): extensions of RWM and FPL. rational and tropical losses. page 21

References Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In ICML. 2014. Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Manfred K. Warmuth. On-line learning algorithms for path experts with non-additive losses. In COLT, 2015. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. T. van Erven, W. Kotlowski, and Manfred K. Warmuth. Follow the leader with dropout perturbations. In COLT, 2014. Adam T. Kalai, Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71(3): 291-307. 2005. Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In COLT, pages 93 105, 2010. page 22

References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: 256-261. Mehryar Mohri. Finite-State Transducers in Language and Speech Processing. Computational Linguistics, 23:2, 1997. Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002. Mehryar Mohri. Weighted automata algorithms. Handbook of Weighted Automata, Monographs in Theoretical Computer Science, pages 213-254. Springer, 2009. Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. JMLR, 4:773 818, 2003. page 23