On-Line Learning with Path Experts and Non-Additive Losses

Similar documents
Advanced Machine Learning

Structured Prediction Theory and Algorithms

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Learning with Large Number of Experts: Component Hedge Algorithm

Boosting Ensembles of Structured Prediction Rules

A Disambiguation Algorithm for Weighted Automata

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences

Time Series Prediction & Online Learning

Structured Prediction

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Natural Language Processing

Advanced Machine Learning

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

Weighted Finite-State Transducer Algorithms An Overview

Perceptron Mistake Bounds

N-Way Composition of Weighted Finite-State Transducers

Large-Scale Training of SVMs with Automata Kernels

Learning Weighted Automata

Online Learning for Time Series Prediction

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

Learning Ensembles of Structured Prediction Rules

Foundations of Machine Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Domain Adaptation for Regression

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Moment Kernels for Regular Distributions

Finite-State Transducers

Learning from Uncertain Data

Advanced Machine Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Weighted Finite State Transducers in Automatic Speech Recognition

Introduction to Finite Automaton

Learning, Games, and Networks

Weighted Finite-State Transducers in Computational Biology

Ensemble Methods for Structured Prediction

arxiv: v4 [cs.lg] 22 Oct 2017

Kernel Methods for Learning Languages

Advanced Machine Learning

Advanced Machine Learning

On-line Variance Minimization

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Structured Prediction

Statistical Methods for NLP

Advanced Machine Learning

Online Kernel PCA with Entropic Matrix Updates

Move from Perturbed scheme to exponential weighting average

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Unit 1: Sequence Models

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Statistical Natural Language Processing

SVM Optimization for Lattice Kernels

Learning Languages with Rational Kernels

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Online Learning with Feedback Graphs

Learning Weighted Automata

The Free Matrix Lunch

Advanced Machine Learning

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Path Kernels and Multiplicative Updates

L p Distance and Equivalence of Probabilistic Automata

A Second-order Bound with Excess Losses

EFFICIENT ALGORITHMS FOR TESTING THE TWINS PROPERTY

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries

Learning with Imperfect Data

Lipschitz Robustness of Finite-state Transducers

ADANET: adaptive learning of neural networks

Optimal and Adaptive Online Learning

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Littlestone s Dimension and Online Learnability

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Recitation 4: Converting Grammars to Chomsky Normal Form, Simulation of Context Free Languages with Push-Down Automata, Semirings

Learning with Rejection

From Bandits to Experts: A Tale of Domination and Independence

Intelligent Systems (AI-2)

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Lecture 13: Structured Prediction

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Automatic Speech Recognition (CS753)

Online Dynamic Programming

Lecture 16: Perceptron and Exponential Weights Algorithm

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning for natural language processing

The on-line shortest path problem under partial monitoring

Worst-Case Bounds for Gaussian Process Models

The Noisy Channel Model and Markov Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Minimax Fixed-Design Linear Regression

Advanced Machine Learning

Exponential Weights on the Hypercube in Polynomial Time

The No-Regret Framework for Online Learning

Intelligent Systems (AI-2)

Learning Weighted Finite State Transducers. SPFLODD October 27, 2011

Transcription:

On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. lx L(y, y 0 )= 1 l Example: edit-distance loss. k=1 1 yk 6=y 0 k L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2

Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3

Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4

Examples: Image Segmentation page 5

Ensemble Structured Prediction Input: labeled sample S =((x 1, y 1 ),...,(x m, y m )) 2 X Y. access to p predictors h 1,...,h p : X! Y. each expert decomposes: h j (x) =(h 1 j(x),...,h l j(x)). multiple path experts. Problem: how do we learn to combine path predictors to devise a more accurate predictor? page 6

Path Experts page 7

On-Line Formulation Learner maintains a distribution p t (Cortes, Kuznetsov, and MM, 2014) over path experts. At each round t : the learner receives input x t ; incurs loss E h pt [L(h(x t ), y t )] = P h p t(h)l(h(x t ), y t ) ; updates distribution: p t! p t+1. On-line-to-batch conversion and guarantees. page 8

Problem Learning: regret guarantees for best algorithms of the form R T = O( p T log N). informative even for N very large. O(N) Problem: computational complexity of algorithm in. can we derive more efficient algorithms when experts admit some structure and when loss is decomposable? page 9

This Talk Can we devise efficient on-line learning algorithms for path experts with non-additive losses? Examples: machine translation (BLEU score). computational biology (sequence similarities). speech recognition (edit-distance). page 10

Two Solution Families Extension of Randomized Weighted Majority (RWM) algorithm: rational losses. tropical losses. Extension of Follow-the-Perturbed Leader (FPL) algorithm: rational losses. tropical losses. page 11

Outline Additive loss. Rational loss. Tropical loss. page 12

Randomized Weighted Majority Randomized-Weighted-Majority () 1 for i 1 to N do 2 w 1,i 1 1 3 p 1,i N 4 for t 1 to T do 5 for i 1 to N do 6 w t+1,i e l t,i w t,i 7 p t+1,i 8 return p T +1 w t+1,i P N j=1 w t+1,j (Littlestone and Warmuth, 1988) Advanced Machine Learning - page 13

Example: Online Shortest Path Problems: path experts. sending packets along paths of a network with routers (vertices); delays (losses). car route selection in presence of traffic (loss). page 14

Additive Loss = e 02 e 23 e 34 For path, l t ( ) =l t (e 02 )+l t (e 23 )+l t (e 34 ). 3 4 e23 e34 e03 2 e24 e14 e02 0 1 e01 page 15

RWM + Path Experts Weight update: at each round, update weight of path expert : =e 1 e n w t [ ] w t 1 [ ] e l t( ) w t [e i ] w t 1 [e i ] e l t(e i ). e34 t 3 4 ; equivalent to (Takimoto and Warmuth, 2002) e03 e23 2 e24 e14 w t 1 [e 14 ] e l t(e 14 ) e02 0 1 e01 Sampling: need to make graph/automaton stochastic. page 16

Weight Pushing Algorithm Weighted directed graph with set of initial vertices and final vertices : I Q q 2 Q for any, e 2 E G =(Q, E, w) F Q for any with, w[e] d[q] = X 2P (q,f) d[orig(e)] 6= 0 d[orig(e)] w[ ]. 1 w[e] d[dest(e)]. (MM 1997; MM, 2009) for any q 2 I, initial weight (q) d(q). page 17

Illustration 0 a/0 b/1 c/5 d/0 1 e/0 f/1 e/4 3 0/15 a/0 b/(1/15) c/(5/15) d/0 1 e/0 f/1 e/(4/9) 3 e/1 2 f/5 e/(9/15) 2 f/(5/9) page 18

Properties Stochasticity: for any with, X e2e[q] w 0 [e] = Invariance: path weight preserved. Weight of path from to : I F q 2 Q d[q] 6= 0 X w[e] d[dest(e)] = d[q] d[q] d[q] =1. e2e[q] (orig(e 1 ))w 0 [e 1 ] w 0 [e n ] = d[orig(e 1 )] w[e 1]d[dest(e 1 )] d[orig(e 1 )] = w[e 1 ] w[e n ]d[dest(e n )] = w[e 1 ] w[e n ]=w[ ]. w[e 2 ]d[dest(e 2 )] d[dest(e 1 )] =e 1 e n page 19

Shortest-Distance Computation Acyclic case: special instance of a generic single-source shortestdistance algorithm working with an arbitrary queue discipline and any k-closed semiring (MM, 2002). linear-time algorithm with the topological order queue O( Q + E ) discipline,. page 20

Outline Additive loss. Rational loss. Tropical loss. page 21

Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input and output y. x T (abb, baa) =.1.2.3.1+.5.3.6.1 page 22

Weighted Determinization (MM 1997) b/3 b/3 0 a/1 a/2 1 c/5 b/3 d/6 3/0 (0,0) a/1 (1,0),(2,1) c/5 d/7 (3,0)/0 2 page 23

Composition Composition of two weighted transducers T 1 and T 2 : (T 1 T 2 )(x, y) = M T 1 (x, z) T 2 (z,y). z2 (Pereira and Riley, 1997; MM et al. 1996) 0 a:b/0.1 1 a:b/0.2 b:b/0.3 2 a:b/0.5 b:b/0.4 a:a/0.6 3/0.7 b:a/0.2 a:b/0.3 2 b:a/0.5 0 b:b/0.1 1 a:b/0.4 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) (0, 0) a:b/.01 (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 page 24

Rational Losses Definition: rational kernel K is a kernel computed by a weighted transducer U, K(x, y) =U(x, y). Theorem: any weighted transducer U = T T 1 over the semiring (R +, +,, 0, 1) defines a PDS rational kernel. Definition: rational loss defined by weighted transducer over the semiring (R +, +,, 0, 1), U 8x, y 2,L U (x, y) = log U(x, y). page 25

Bigram Transducer Bigram transducer T bigram defined over prob. semiring: a:ε/1 a:a/1 a:a/1 a:ε/1 0 b:b/1 1 b:b/1 2/1 b:ε/1 b:ε/1 Property: 8x 2,u2 2, T bigram (x, u) = x u. bigram kernel transducer: U bigram = T bigram T 1 bigram ; U bigram (x, y) = X x u y u. u2 2 page 26

Path Expert Automata 3 4 e23 e03 e02 e34 2 e24 e14 0 1 e01 a 3 4 a b 2 a a b 0 a 1 3 e34:a 4 e23:a e24:b e03:a 2 e14:a e02:b 0 e01:a 1 Path expert automaton Prediction automaton A t at time t. Expert-to-prediction transducer T t at time t. page 27

Questions Weight update (with a rational loss): how to compute for each path expert Sampling: exp TX t=1 how can we sample according to the distribution defined by these weights? L( (t),y t ). page 28

η-power Semiring For any > 0, system S =(R + [ {+1},,, 0, 1) where 8x, y 2 R + [ {+1}, x y = x 1 + y 1. Semiring morphism: :(R + [ {+1}, +,, 0, 1)! S x 7! x. page 29

η-power Semiring WFSTs Y t : WFA over S accepting only y t with all weights set to 1. T t : expert-to-prediction WFST over with all weights set to 1. S eu : WFST over S derived from U by changing each weight into x. x page 30

Path Weights Proposition: for the following WFAs over, V t =Det( (Y t U e Tt )) and W t = V 1 V t ; for any t 2 [1,T] and path expert, W t ( ) =e P t s=1 L U(y s, (s)). S Proof: observe that V t ( ) = (Y t e U Tt )( ) = M Y t (z 1 ) U(z e 1,z 2 )T t (z 2, ) z 1,z 2 = e U(y t, (t)) = e L U(y t, (t)). page 31

Rational Weighted Majority Alg. RRWM(T ) 1 W 0 1. deterministic one-state WFA over the semiring S. 2 for t 1 to T do 3 x t Receive() 4 T t PathExpertPredictionTransducer(x t ) 5 V t Det( (Y t U e Tt )) 6 W t W t 1 V t 7 W t WeightPush(W t, (+, )) 8 by t+1 Sample(W t ) 9 return W T page 32

Time Complexity Polynomial-time overall complexity: worst-case complexity of determinization: exponential. but, complexity is only polynomial in this context. proof based on new string combinatorics arguments. page 33

Regret Guarantee Theorem: let N be the total number of path experts and M an upper bound on the loss of any path expert. Then, the following upper bound holds for the regret of RRWM: E[R T (RRWM)] apple 2M p T log N. page 34

Outline Additive loss. Rational loss. Tropical loss. page 35

Tropical Losses Definition: rational loss defined by weighted transducer over the semiring (R [ { 1, +1}, min, +, +1, 0), U 8x, y 2,L U (x, y) =U(x, y). page 36

Edit-Distance Transducer Edit-distance weighted transducer U edit defined over tropical semiring (R [ { 1, +1}, min, +, +1, 0). page 37

Algorithm Syntactically the same algorithm! Only change semiring from S to ([0, 1], max,, 0, 1). page 38

Conclusion On-line learning algorithms for path experts with rational or tropical losses: Rational and Tropical Randomized Weighted Majority. Rational and Tropical Follow-the-Perturbed Leader. Polynomial-time algorithms for rational losses. Applications to MT, ASR, computational biology. Implementation using OpenFst. page 39