On-Line Learning with Path Experts and Non-Additive Losses

Size: px
Start display at page:

Download "On-Line Learning with Path Experts and Non-Additive Losses"

Transcription

1 On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. lx L(y, y 0 )= 1 l Example: edit-distance loss. k=1 1 yk 6=y 0 k L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2

3 Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3

4 Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4

5 Examples: Image Segmentation page 5

6 Ensemble Structured Prediction Input: labeled sample S =((x 1, y 1 ),...,(x m, y m )) 2 X Y. access to p predictors h 1,...,h p : X! Y. each expert decomposes: h j (x) =(h 1 j(x),...,h l j(x)). multiple path experts. Problem: how do we learn to combine path predictors to devise a more accurate predictor? page 6

7 Path Experts page 7

8 On-Line Formulation Learner maintains a distribution p t (Cortes, Kuznetsov, and MM, 2014) over path experts. At each round t : the learner receives input x t ; incurs loss E h pt [L(h(x t ), y t )] = P h p t(h)l(h(x t ), y t ) ; updates distribution: p t! p t+1. On-line-to-batch conversion and guarantees. page 8

9 Problem Learning: regret guarantees for best algorithms of the form R T = O( p T log N). informative even for N very large. O(N) Problem: computational complexity of algorithm in. can we derive more efficient algorithms when experts admit some structure and when loss is decomposable? page 9

10 This Talk Can we devise efficient on-line learning algorithms for path experts with non-additive losses? Examples: machine translation (BLEU score). computational biology (sequence similarities). speech recognition (edit-distance). page 10

11 Two Solution Families Extension of Randomized Weighted Majority (RWM) algorithm: rational losses. tropical losses. Extension of Follow-the-Perturbed Leader (FPL) algorithm: rational losses. tropical losses. page 11

12 Outline Additive loss. Rational loss. Tropical loss. page 12

13 Randomized Weighted Majority Randomized-Weighted-Majority () 1 for i 1 to N do 2 w 1,i p 1,i N 4 for t 1 to T do 5 for i 1 to N do 6 w t+1,i e l t,i w t,i 7 p t+1,i 8 return p T +1 w t+1,i P N j=1 w t+1,j (Littlestone and Warmuth, 1988) Advanced Machine Learning - page 13

14 Example: Online Shortest Path Problems: path experts. sending packets along paths of a network with routers (vertices); delays (losses). car route selection in presence of traffic (loss). page 14

15 Additive Loss = e 02 e 23 e 34 For path, l t ( ) =l t (e 02 )+l t (e 23 )+l t (e 34 ). 3 4 e23 e34 e03 2 e24 e14 e e01 page 15

16 RWM + Path Experts Weight update: at each round, update weight of path expert : =e 1 e n w t [ ] w t 1 [ ] e l t( ) w t [e i ] w t 1 [e i ] e l t(e i ). e34 t 3 4 ; equivalent to (Takimoto and Warmuth, 2002) e03 e23 2 e24 e14 w t 1 [e 14 ] e l t(e 14 ) e e01 Sampling: need to make graph/automaton stochastic. page 16

17 Weight Pushing Algorithm Weighted directed graph with set of initial vertices and final vertices : I Q q 2 Q for any, e 2 E G =(Q, E, w) F Q for any with, w[e] d[q] = X 2P (q,f) d[orig(e)] 6= 0 d[orig(e)] w[ ]. 1 w[e] d[dest(e)]. (MM 1997; MM, 2009) for any q 2 I, initial weight (q) d(q). page 17

18 Illustration 0 a/0 b/1 c/5 d/0 1 e/0 f/1 e/4 3 0/15 a/0 b/(1/15) c/(5/15) d/0 1 e/0 f/1 e/(4/9) 3 e/1 2 f/5 e/(9/15) 2 f/(5/9) page 18

19 Properties Stochasticity: for any with, X e2e[q] w 0 [e] = Invariance: path weight preserved. Weight of path from to : I F q 2 Q d[q] 6= 0 X w[e] d[dest(e)] = d[q] d[q] d[q] =1. e2e[q] (orig(e 1 ))w 0 [e 1 ] w 0 [e n ] = d[orig(e 1 )] w[e 1]d[dest(e 1 )] d[orig(e 1 )] = w[e 1 ] w[e n ]d[dest(e n )] = w[e 1 ] w[e n ]=w[ ]. w[e 2 ]d[dest(e 2 )] d[dest(e 1 )] =e 1 e n page 19

20 Shortest-Distance Computation Acyclic case: special instance of a generic single-source shortestdistance algorithm working with an arbitrary queue discipline and any k-closed semiring (MM, 2002). linear-time algorithm with the topological order queue O( Q + E ) discipline,. page 20

21 Outline Additive loss. Rational loss. Tropical loss. page 21

22 Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input and output y. x T (abb, baa) = page 22

23 Weighted Determinization (MM 1997) b/3 b/3 0 a/1 a/2 1 c/5 b/3 d/6 3/0 (0,0) a/1 (1,0),(2,1) c/5 d/7 (3,0)/0 2 page 23

24 Composition Composition of two weighted transducers T 1 and T 2 : (T 1 T 2 )(x, y) = M T 1 (x, z) T 2 (z,y). z2 (Pereira and Riley, 1997; MM et al. 1996) 0 a:b/0.1 1 a:b/0.2 b:b/0.3 2 a:b/0.5 b:b/0.4 a:a/0.6 3/0.7 b:a/0.2 a:b/0.3 2 b:a/0.5 0 b:b/0.1 1 a:b/0.4 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) (0, 0) a:b/.01 (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 page 24

25 Rational Losses Definition: rational kernel K is a kernel computed by a weighted transducer U, K(x, y) =U(x, y). Theorem: any weighted transducer U = T T 1 over the semiring (R +, +,, 0, 1) defines a PDS rational kernel. Definition: rational loss defined by weighted transducer over the semiring (R +, +,, 0, 1), U 8x, y 2,L U (x, y) = log U(x, y). page 25

26 Bigram Transducer Bigram transducer T bigram defined over prob. semiring: a:ε/1 a:a/1 a:a/1 a:ε/1 0 b:b/1 1 b:b/1 2/1 b:ε/1 b:ε/1 Property: 8x 2,u2 2, T bigram (x, u) = x u. bigram kernel transducer: U bigram = T bigram T 1 bigram ; U bigram (x, y) = X x u y u. u2 2 page 26

27 Path Expert Automata 3 4 e23 e03 e02 e34 2 e24 e e01 a 3 4 a b 2 a a b 0 a 1 3 e34:a 4 e23:a e24:b e03:a 2 e14:a e02:b 0 e01:a 1 Path expert automaton Prediction automaton A t at time t. Expert-to-prediction transducer T t at time t. page 27

28 Questions Weight update (with a rational loss): how to compute for each path expert Sampling: exp TX t=1 how can we sample according to the distribution defined by these weights? L( (t),y t ). page 28

29 η-power Semiring For any > 0, system S =(R + [ {+1},,, 0, 1) where 8x, y 2 R + [ {+1}, x y = x 1 + y 1. Semiring morphism: :(R + [ {+1}, +,, 0, 1)! S x 7! x. page 29

30 η-power Semiring WFSTs Y t : WFA over S accepting only y t with all weights set to 1. T t : expert-to-prediction WFST over with all weights set to 1. S eu : WFST over S derived from U by changing each weight into x. x page 30

31 Path Weights Proposition: for the following WFAs over, V t =Det( (Y t U e Tt )) and W t = V 1 V t ; for any t 2 [1,T] and path expert, W t ( ) =e P t s=1 L U(y s, (s)). S Proof: observe that V t ( ) = (Y t e U Tt )( ) = M Y t (z 1 ) U(z e 1,z 2 )T t (z 2, ) z 1,z 2 = e U(y t, (t)) = e L U(y t, (t)). page 31

32 Rational Weighted Majority Alg. RRWM(T ) 1 W 0 1. deterministic one-state WFA over the semiring S. 2 for t 1 to T do 3 x t Receive() 4 T t PathExpertPredictionTransducer(x t ) 5 V t Det( (Y t U e Tt )) 6 W t W t 1 V t 7 W t WeightPush(W t, (+, )) 8 by t+1 Sample(W t ) 9 return W T page 32

33 Time Complexity Polynomial-time overall complexity: worst-case complexity of determinization: exponential. but, complexity is only polynomial in this context. proof based on new string combinatorics arguments. page 33

34 Regret Guarantee Theorem: let N be the total number of path experts and M an upper bound on the loss of any path expert. Then, the following upper bound holds for the regret of RRWM: E[R T (RRWM)] apple 2M p T log N. page 34

35 Outline Additive loss. Rational loss. Tropical loss. page 35

36 Tropical Losses Definition: rational loss defined by weighted transducer over the semiring (R [ { 1, +1}, min, +, +1, 0), U 8x, y 2,L U (x, y) =U(x, y). page 36

37 Edit-Distance Transducer Edit-distance weighted transducer U edit defined over tropical semiring (R [ { 1, +1}, min, +, +1, 0). page 37

38 Algorithm Syntactically the same algorithm! Only change semiring from S to ([0, 1], max,, 0, 1). page 38

39 Conclusion On-line learning algorithms for path experts with rational or tropical losses: Rational and Tropical Randomized Weighted Majority. Rational and Tropical Follow-the-Perturbed Leader. Polynomial-time algorithms for rational losses. Applications to MT, ASR, computational biology. Implementation using OpenFst. page 39

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner

More information

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

A Disambiguation Algorithm for Weighted Automata

A Disambiguation Algorithm for Weighted Automata A Disambiguation Algorithm for Weighted Automata Mehryar Mohri a,b and Michael D. Riley b a Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012. b Google Research, 76 Ninth

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences Speech Recognition Lecture 4: Weighted Transducer Software Library Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Software Libraries FSM Library: Finite-State Machine Library.

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations Language Technology CUNY Graduate Center Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required hard optional Liang Huang Probabilities experiment

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a

More information

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language Part I. Theory and Algorithms Overview. Preliminaries Semirings Weighted Automata and Transducers.

More information

Weighted Finite-State Transducer Algorithms An Overview

Weighted Finite-State Transducer Algorithms An Overview Weighted Finite-State Transducer Algorithms An Overview Mehryar Mohri AT&T Labs Research Shannon Laboratory 80 Park Avenue, Florham Park, NJ 0793, USA mohri@research.att.com May 4, 004 Abstract Weighted

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

N-Way Composition of Weighted Finite-State Transducers

N-Way Composition of Weighted Finite-State Transducers International Journal of Foundations of Computer Science c World Scientific Publishing Company N-Way Composition of Weighted Finite-State Transducers CYRIL ALLAUZEN Google Research, 76 Ninth Avenue, New

More information

Large-Scale Training of SVMs with Automata Kernels

Large-Scale Training of SVMs with Automata Kernels Large-Scale Training of SVMs with Automata Kernels Cyril Allauzen, Corinna Cortes, and Mehryar Mohri, Google Research, 76 Ninth Avenue, New York, NY Courant Institute of Mathematical Sciences, 5 Mercer

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers International Journal of Foundations of Computer Science c World Scientific Publishing Company Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers Mehryar Mohri mohri@research.att.com

More information

Learning Ensembles of Structured Prediction Rules

Learning Ensembles of Structured Prediction Rules Learning Ensembles of Structured Prediction Rules Corinna Cortes Google Research 111 8th Avenue, New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street, New York, NY

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13

Classification. Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY. Slides adapted from Mohri. Jordan Boyd-Graber UMD Classification 1 / 13 Classification Jordan Boyd-Graber University of Maryland WEIGHTED MAJORITY Slides adapted from Mohri Jordan Boyd-Graber UMD Classification 1 / 13 Beyond Binary Classification Before we ve talked about

More information

Moment Kernels for Regular Distributions

Moment Kernels for Regular Distributions Moment Kernels for Regular Distributions Corinna Cortes Google Labs 1440 Broadway, New York, NY 10018 corinna@google.com Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florham Park, NJ 07932 mohri@research.att.com

More information

Finite-State Transducers

Finite-State Transducers Finite-State Transducers - Seminar on Natural Language Processing - Michael Pradel July 6, 2007 Finite-state transducers play an important role in natural language processing. They provide a model for

More information

Learning from Uncertain Data

Learning from Uncertain Data Learning from Uncertain Data Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florham Park, NJ 07932, USA mohri@research.att.com Abstract. The application of statistical methods to natural language processing

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Weighted Finite State Transducers in Automatic Speech Recognition

Weighted Finite State Transducers in Automatic Speech Recognition Weighted Finite State Transducers in Automatic Speech Recognition ZRE lecture 10.04.2013 Mirko Hannemann Slides provided with permission, Daniel Povey some slides from T. Schultz, M. Mohri and M. Riley

More information

Introduction to Finite Automaton

Introduction to Finite Automaton Lecture 1 Introduction to Finite Automaton IIP-TL@NTU Lim Zhi Hao 2015 Lecture 1 Introduction to Finite Automata (FA) Intuition of FA Informally, it is a collection of a finite set of states and state

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Ensemble Methods for Structured Prediction

Ensemble Methods for Structured Prediction Ensemble Methods for Structured Prediction Corinna Cortes Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY

More information

arxiv: v4 [cs.lg] 22 Oct 2017

arxiv: v4 [cs.lg] 22 Oct 2017 Online Learning with Automata-based Expert Sequences Mehryar Mohri Scott Yang October 4, 07 arxiv:705003v4 cslg Oct 07 Abstract We consider a general framewor of online learning with expert advice where

More information

Kernel Methods for Learning Languages

Kernel Methods for Learning Languages Kernel Methods for Learning Languages Leonid (Aryeh) Kontorovich a and Corinna Cortes b and Mehryar Mohri c,b a Department of Mathematics Weizmann Institute of Science, Rehovot, Israel 76100 b Google Research,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

On-line Variance Minimization

On-line Variance Minimization On-line Variance Minimization Manfred Warmuth Dima Kuzmin University of California - Santa Cruz 19th Annual Conference on Learning Theory M. Warmuth, D. Kuzmin (UCSC) On-line Variance Minimization COLT06

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth University of California - Santa Cruz ICML 2007, Corvallis, Oregon April 23, 2008 D. Kuzmin, M. Warmuth (UCSC) Online Kernel

More information

Move from Perturbed scheme to exponential weighting average

Move from Perturbed scheme to exponential weighting average Move from Perturbed scheme to exponential weighting average Chunyang Xiao Abstract In an online decision problem, one makes decisions often with a pool of decisions sequence called experts but without

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Unit 1: Sequence Models

Unit 1: Sequence Models CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Statistical Natural Language Processing

Statistical Natural Language Processing 199 CHAPTER 4 Statistical Natural Language Processing 4.0 Introduction............................. 199 4.1 Preliminaries............................. 200 4.2 Algorithms.............................. 201

More information

SVM Optimization for Lattice Kernels

SVM Optimization for Lattice Kernels SVM Optimization for Lattice Kernels Cyril Allauzen Google Research 76 Ninth Avenue New York, NY allauzen@google.com Corinna Cortes Google Research 76 Ninth Avenue New York, NY corinna@google.com Mehryar

More information

Learning Languages with Rational Kernels

Learning Languages with Rational Kernels Learning Languages with Rational Kernels Corinna Cortes 1, Leonid Kontorovich 2, and Mehryar Mohri 3,1 1 Google Research, 76 Ninth Avenue, New York, NY 111. 2 Carnegie Mellon University, 5 Forbes Avenue,

More information

Pre-Initialized Composition For Large-Vocabulary Speech Recognition

Pre-Initialized Composition For Large-Vocabulary Speech Recognition Pre-Initialized Composition For Large-Vocabulary Speech Recognition Cyril Allauzen, Michael Riley Google Research, 76 Ninth Avenue, New York, NY, USA allauzen@google.com, riley@google.com Abstract This

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan UCSC and NICTA Talk at NYAS Conference, 10-27-06 Thanks to Dima and Jun 1 Let s keep it simple Linear Regression 10 8 6 4 2 0 2 4 6 8 8 6 4 2

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Borja Balle 1 and Mehryar Mohri 2,3 1 School of Computer Science, McGill University, Montréal, Canada 2 Courant Institute of Mathematical Sciences, New York, NY 3 Google Research,

More information

The Free Matrix Lunch

The Free Matrix Lunch The Free Matrix Lunch Wouter M. Koolen Wojciech Kot lowski Manfred K. Warmuth Tuesday 24 th April, 2012 Koolen, Kot lowski, Warmuth (RHUL) The Free Matrix Lunch Tuesday 24 th April, 2012 1 / 26 Introduction

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Non-Ideal World time real world ideal domain sampling page 2 Outline Domain adaptation. Multiple-source

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

Path Kernels and Multiplicative Updates

Path Kernels and Multiplicative Updates Journal of Machine Learning Research 4 (2003) 773-818 Submitted 5/03; ublished 10/03 ath Kernels and Multiplicative Updates Eiji Takimoto Graduate School of Information Sciences Tohoku University, Sendai,

More information

L p Distance and Equivalence of Probabilistic Automata

L p Distance and Equivalence of Probabilistic Automata International Journal of Foundations of Computer Science c World Scientific Publishing Company L p Distance and Equivalence of Probabilistic Automata Corinna Cortes Google Research, 76 Ninth Avenue, New

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

EFFICIENT ALGORITHMS FOR TESTING THE TWINS PROPERTY

EFFICIENT ALGORITHMS FOR TESTING THE TWINS PROPERTY Journal of Automata, Languages and Combinatorics u (v) w, x y c Otto-von-Guericke-Universität Magdeburg EFFICIENT ALGORITHMS FOR TESTING THE TWINS PROPERTY Cyril Allauzen AT&T Labs Research 180 Park Avenue

More information

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Jose Oncina Dept. Lenguajes y Sistemas Informáticos - Universidad de Alicante oncina@dlsi.ua.es September

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Lipschitz Robustness of Finite-state Transducers

Lipschitz Robustness of Finite-state Transducers Lipschitz Robustness of Finite-state Transducers Roopsha Samanta IST Austria Joint work with Tom Henzinger and Jan Otop December 16, 2014 Roopsha Samanta Lipschitz Robustness of Finite-state Transducers

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 5 Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient coputation of inner products

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Kernel Methods Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Motivation Efficient coputation of inner products in high diension. Non-linear decision

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Recitation 4: Converting Grammars to Chomsky Normal Form, Simulation of Context Free Languages with Push-Down Automata, Semirings

Recitation 4: Converting Grammars to Chomsky Normal Form, Simulation of Context Free Languages with Push-Down Automata, Semirings Recitation 4: Converting Grammars to Chomsky Normal Form, Simulation of Context Free Languages with Push-Down Automata, Semirings 11-711: Algorithms for NLP October 10, 2014 Conversion to CNF Example grammar

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (S753) Lecture 5: idden Markov s (Part I) Instructor: Preethi Jyothi August 7, 2017 Recap: WFSTs applied to ASR WFST-based ASR System Indices s Triphones ontext Transducer

More information

Online Dynamic Programming

Online Dynamic Programming Online Dynamic Programming Holakou Rahmanian Department of Computer Science University of California Santa Cruz Santa Cruz, CA 956 holakou@ucsc.edu S.V.N. Vishwanathan Department of Computer Science University

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

The on-line shortest path problem under partial monitoring

The on-line shortest path problem under partial monitoring The on-line shortest path problem under partial monitoring András György Machine Learning Research Group Computer and Automation Research Institute Hungarian Academy of Sciences Kende u. 11-13, Budapest,

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Time Series Prediction VITALY KUZNETSOV KUZNETSOV@ GOOGLE RESEARCH.. MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.. Motivation Time series prediction: stock values.

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Learning Weighted Finite State Transducers. SPFLODD October 27, 2011

Learning Weighted Finite State Transducers. SPFLODD October 27, 2011 Learning Weighted Finite State Transducers SPFLODD October 27, 2011 Background This lecture is based on a paper by Jason Eisner at ACL 2002, Parameter EsImaIon for ProbabilisIc Finite State Transducers.

More information