Learning Weighted Automata

Size: px
Start display at page:

Download "Learning Weighted Automata"

Transcription

1 Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Weighted Automata (WFAs) page 2

3 Motivation Weighted automata (WFAs): image processing (Kari, 1993). automatic speech recognition (MM, Pereira, Riley, 1996, 2008). speech synthesis (Sproat, 1995; Allauzen, MM, Riley 2004). machine translation (e.g., Iglesias et al., 2011). many other NLP tasks (very long list of refs). bioinformatics (Durbin et al., 1998). optical character recognition (Bruel, 2008). model checking (Baier et al., 2009; Aminof et al., 2011). machine learning (Cortes, Kuznetsov, MM, Warmuth, 2015). 3 page

4 Motivation Theory: rational power series, extensively studied (Eilenberg, 1993; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986; Berstel and Retenauer, 1988). Algorithms (see survey chapter: MM, 2009): rational operations, intersection or composition, epsilon-removal, determinization, minimization, disambiguation. page 4

5 Learning Automata Classical results: passive learning (Gold, 1978; Angluin, 1978), (Pitt and Warmuth, 1993). active learning (model with membership and equivalences queries) (Angluin, 1987), (Bergadano and Varrichio, 1994, 1996, 2000) Spectral learning: algorithms (Hsu et al., 2009; Bailly et al., 2009), (Balle and MM, 2012). natural language processing (Balle et al., 2014). reinforcement learning (Boots et al., 2009; Hamilton et al., 2013). page 5

6 Learning Guarantees Existing analyses: (Hsu et al., 2009; Denis et al., 2016): statistical consistency, finitesample guarantees in the realizable case. (Balle and MM, 2012): algorithm-dependent finite-sample guarantee based on a stability analysis. (Kulesza et al., 2014): algorithm-dependent guarantees with distributional assumption (data drawn from some WFA). can we derive general theoretical guarantees for learning WFAs? page 6

7 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 7

8 Learning Scenario Training data: sample drawn i.i.d. from some distribution D, S = (x 1,y 1 ),...,(x m,y m ). X Yaccording to Problem: find WFA A in hypothesis set H with small expected loss L(A) = E [L(A(x),y)]. (x,y) D note: problem not assumed realizable (distribution according to probabilistic WFA). page 8

9 Emp. Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b]. sample S =(z 1,...,z m ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1}. apple apple 1 1 apple. g(z1 apple ). 1 mx br S (G) =E sup. g2g m. =E sup m g(z m ) g2g m correlation with random noise i=1 ig(z i ). page 9

10 Emp. Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b]. sample S =(z 1,...,z m ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1}. apple apple 1 1 apple. g(z1 apple ). 1 mx br S (G) =E sup. g2g m. =E sup m g(z m ) g2g m correlation with random noise Rademacher complexity of G: R m (G) = E S D m[ R b S (G)]. i=1 ig(z i ). page 10

11 Rademacher Complexity Bound Theorem: Let G be a family of functions mapping from Z to [0, 1]. Then, for any > 0, with probability at least 1, the following holds for all g 2G: s E[g(z)] apple 1 mx log 1 g(z i )+2R m (G)+ m 2m. i=1 s E[g(z)] apple 1 mx g(z i )+2R m b log 2 S (G)+3 2m. i=1 Proof: Apply McDiarmid s inequality to (Koltchinskii and Panchenko, 2002; MM et al., 2012) (S) =supe[g] g2g b ES [g]. page 11

12 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 12

13 Learning Automata Classical formulation: sample find smallest automaton S = (x 1,y 1 ),...,(x m,y m ) 2 ( {0, 1}) m. min A kak 0 Aconsistent with sample: s.t. 8i 2 [m], A(x i )=y i. NP-complete problem (Gold 1978, Angluin 1978); even polynomial approximation is NP-hard (Pitt and Wamuth, 1993). not the right formulation. page 13

14 Analogy: Linear Classifiers Sparse learning formulation: min w2r N kwk 0 s.t. Aw = b. non-convex optimization problem. NP-hard problem. not the right formulation. alternative norm (e.g. norm-1). page 14

15 Questions What is the appropriate norm to use for learning WFAs? Which hypothesis sets should we consider? description in terms of Hankel matrix. description in terms of transition matrices. description in terms of function norm. page 15

16 WFA - Definition WFA A over a semiring (S,,, 0, 1) and alphabet with a finite set of states is defined by Q A initial weight vector A 2 S QA ; final weight vector A 2 S Q A ; transition weight matrix A a 2 S Q A Q A, a 2. Function defined: for any x = x 1 x k 2, Notation: A x = A x1 A xk. A(x) = > AA x1 A xk A. page 16

17 WFA - Illustration A = A = A a = A b = state number initial weight final weight page 17

18 Hankel Matrix Definition: the Hankel matrix H f of function f :! R is the infinite matrix defined by 8u, v 2, H f (u, v) =f(uv). redundancy: f(x) appears in all entries (u, v) with x = uv. H f = 2 v f(uv) 7 5 u.. page 18

19 Theorem of Fliess Theorem (Fliess, 1974): rank(h f )<+1 iff f is rational. In that case, there exists a (minimal) WFA A representing f with rank(h f ) states. page 19

20 Theorem of Fliess Theorem (Fliess, 1974): rank(h f )<+1 iff f is rational. In that case, there exists a (minimal) WFA A representing f with rank(h f ) states. Proof: For any u, v 2, if H is the Hankel matrix of A, then, H(u, v) =A(uv) =( > AA u )(A v A). Thus, H = P A S > A with " #. P A = > A A u 2 R. Q A S A =. ". > A A> v. # 2 R Q A. page 20

21 Standardization (Schützenberger, 1961; Cardon and Crochemore, 1980) page 21

22 Hypothesis Sets In view of the theorem of Fliess, a natural choice is n o H 0 = A: rank(h A ) <r. for some r<+1. But, rank does not define a convex function (equivalent of norm-0 for column vectors). Instead, definition based on nuclear norm and more generally Schatten p-norms: n o H p = A: kh A k p <r, with kh A k p = apple X i p i (H A) 1 p. page 22

23 This Talk Learning scenario, complexity tools. Hypothesis sets. Learning guarantees. page 23

24 Schatten Norms Common choices for p : p =1: nuclear norm (or trace norm) kak. q 1 =Tr p A> A p =2: Frobenius norm kak 2 = Tr[A > A]. p =+1: spectral norm kak +1 = p max(a > A)= max (A). Properties: 1 Hölder s inequality: for p, p 1 with p + 1 p =1, ha, Bi applekak p kbk p. von Neumann s trace inequality theorem: ha, Bi apple P i i(a) i (B). page 24

25 Emp. Rademacher Complexity By definition of the dual norm (or Hölder s inequality), for a sample S =(x 1,...,x m ) and any decomposition x i = u i v i, br S (H p )= 1 m E apple = 1 m E apple apple r m E apple sup A2H p mx i=1 sup kh A k p appler m X i=1 ie > u i H A e vi D m X i=1 ie vi e > u i p. ie vi e > u i, H A E page 25

26 Rad. Complexity for p = 2 Lemma: br S (H 2 ) apple r p m. Proof: since p =2for p =2, br S (H 2 ) apple r apple m m E X ie vi e > u i 2 i=1 v apple r apple u X t m E ie vi e m > 2 u i 2 i=1 v = r apple u X m te i jhe ui e m > v i, e uj e > v j i i,j=1 v = r apple u X m t E he ui e m > v i, e ui e > v i i = p r. m i=1 page 26

27 Lower Bound By the Khintchine-Kahane inequality, v apple r m m E X apple ie vi e > r u u i p t E 2 2m i=1 m X i=1 ie vi e > u i 2 = p r apple se he ui e > v 2m i, e vi e > u i i = 1 p 2 r m. 2 page 27

28 Generalization Bound Theorem: assume that the loss L is the L p loss and is bounded by M. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all A 2 H 2, s L(A) apple L b S (A)+ 2µ p pr log 1 + M m 2m, where µ p = pm p 1. page 28

29 Proof By Talagrand s contraction lemma, o br n(x, y) 7! L(A(x),y): A 2 H 2 = 1 m E apple apple µ p m E = µ p m E = µ p m E apple apple apple sup A2H 2 sup A2H 2 sup A2H 2 sup A2H 2 mx i=1 mx i=1 mx i=1 mx i=1 i A(x i ) y i p i(a(x i ) y i ) (x 7! x p µ p -Lipschitz) ia(x i ) + µ apple m p m E X i=1 ia(x i ) = µ p b RS (H 2 ). iy i page 29

30 Rad. Complexity for p = 1 Lemma: br S (H 1 ) apple r m apple c 1 log(2m + 1) + c 2 p WS log(2m + 1), where W S =min decomp. max{u S,V S }, with U S = max u2 {i: u i = u} V S = max u2 {i: v i = v}. Proof: apply Matrix Berstein bound with M =1, d apple m, V 1 = P i e u i e > u i V 2 = P i e v i e > v i kv 1 k op = U S kv 2 k op = V S. page 30

31 Matrix Bernstein Bound Corollary: let M = P i M i be a finite sum of i.i.d. random matrices with E[M] =0 and km i k op apple M for all i; P i E[M im > i ] V P 1 and i E[M> i M i] V 2. Then, E[kMk op ] apple c 1 M log(d + 1) + c 2 p log(d + 1), (MInsker, 2011; Tropp, 2015) 2+8/ log(2) 3 where c 1 =, c 2 = p 2+4/ p log(2) ; V = diag(v 1, V 2 ), = kvk op, d = Tr(V). kvk op page 31

32 Generalization Bound Theorem: assume that the loss L is the L p loss and is bounded by M. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all A 2 H 1, L(A) apple L b S (A)+ 2µ pc 1 r log(2m + 1) m + 2µ pc 2 r p s W S log(2m + 1) log 2 +3M m 2m, where c 1 = 2+8/ log(2) 3 µ p = pm p 1. c 2 = p 2+4/ p log(2) page 32

33 Conclusion Theory of learning WFAs: data-dependent learning guarantees. can help guide design of algorithms. key role of notion of Hankel matrix ( spectral data-dependent combinatorial quantities (e.g. ). methods, (Hsu et al., 2009; Balle and MM, 2012)). W S page 33

34 Questions Questions: can we use learning bounds (e.g. W S ) to select prefixes/ suffixes defining sub-blocks of Hankel matrix? can we derive learning guarantees for more general algorithms than (Balle and MM, 2012)? computational challenges. page 34

35 Hypothesis Sets Definition based on matrix representation: n A n,p,r = A: Q A = n, k k appler, k k p apple r, max a kak p apple r o. page 35

36 Rademacher Complexities Corollary: let L S = max x i and L m = E, then, i S D m[l S] r n(n + 2)r r br S (A n,p,r ) apple 6 C + p log(l S + 2) m r n(n + 2)r r R m (A n,p,r ) apple 6 C + p log(l m + 2) m, where q (log(r r )) + n +2 + C = q r = max{ r /r, q log + (r )+ q r /r, p r r /r } q log + ( r)+3 p log(2) page 36

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Borja Balle 1 and Mehryar Mohri 2,3 1 School of Computer Science, McGill University, Montréal, Canada 2 Courant Institute of Mathematical Sciences, New York, NY 3 Google Research,

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Generalization Bounds for Learning Weighted Automata

Generalization Bounds for Learning Weighted Automata Generalization Bounds for Learning Weighted Autoata Borja Balle a,, Mehryar Mohri b,c a Departent of Matheatics and Statistics, Lancaster University, Lancaster, UK b Courant Institute of Matheatical Sciences,

More information

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Spectral Learning of Weighted Automata

Spectral Learning of Weighted Automata Spectral Learning of Weighted Automata A Forward-Backward Perspective Borja Balle 1, Xavier Carreras 1, Franco M. Luque 2, and Ariadna Quattoni 1 1 Universitat Politècnica de Catalunya, Barcelona, Spain

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

A Disambiguation Algorithm for Weighted Automata

A Disambiguation Algorithm for Weighted Automata A Disambiguation Algorithm for Weighted Automata Mehryar Mohri a,b and Michael D. Riley b a Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012. b Google Research, 76 Ninth

More information

Spectral Learning of Weighted Automata

Spectral Learning of Weighted Automata Noname manuscript No. (will be inserted by the editor) Spectral Learning of Weighted Automata A Forward-Backward Perspective Borja Balle Xavier Carreras Franco M. Luque Ariadna Quattoni Received: date

More information

The definitions and notation are those introduced in the lectures slides. R Ex D [h

The definitions and notation are those introduced in the lectures slides. R Ex D [h Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Functional Analysis Review

Functional Analysis Review Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520: Statistical Learning Theory and Applications September 9, 2013 1 2 3 4 Vector Space A vector space is a set V with binary

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Spectral learning of weighted automata

Spectral learning of weighted automata Mach Learn (2014) 96:33 63 DOI 10.1007/s10994-013-5416-x Spectral learning of weighted automata A forward-backward perspective Borja Balle Xavier Carreras Franco M. Luque Ariadna Quattoni Received: 8 December

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

arxiv: v1 [cs.lg] 25 Oct 2016

arxiv: v1 [cs.lg] 25 Oct 2016 Generalization Bounds for Weighted Autoata B. Balle 1 and M. Mohri 2,3 arxiv:1610.07883v1 cs.lg 25 Oct 2016 1 Departent of Matheatics and Statistics, Lancaster University 2 Courant Institute of Matheatical

More information

L p Distance and Equivalence of Probabilistic Automata

L p Distance and Equivalence of Probabilistic Automata International Journal of Foundations of Computer Science c World Scientific Publishing Company L p Distance and Equivalence of Probabilistic Automata Corinna Cortes Google Research, 76 Ninth Avenue, New

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

arxiv:math/ v1 [math.co] 18 Jul 2006

arxiv:math/ v1 [math.co] 18 Jul 2006 Extending the scalars of minimizations arxiv:math/06074v [math.co] 8 Jul 2006 G. DUCHAMP É. LAUGEROTTE 2 Laboratoire d Informatique Fondamentale et Appliquée de Rouen Faculté des Sciences et des Techniques

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Rank, Trace-Norm & Max-Norm

Rank, Trace-Norm & Max-Norm Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4

More information

Weighted Finite-State Transducer Algorithms An Overview

Weighted Finite-State Transducer Algorithms An Overview Weighted Finite-State Transducer Algorithms An Overview Mehryar Mohri AT&T Labs Research Shannon Laboratory 80 Park Avenue, Florham Park, NJ 0793, USA mohri@research.att.com May 4, 004 Abstract Weighted

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

N-Way Composition of Weighted Finite-State Transducers

N-Way Composition of Weighted Finite-State Transducers International Journal of Foundations of Computer Science c World Scientific Publishing Company N-Way Composition of Weighted Finite-State Transducers CYRIL ALLAUZEN Google Research, 76 Ninth Avenue, New

More information

Lecture 3. Random Fourier measurements

Lecture 3. Random Fourier measurements Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our

More information

Learning k-edge Deterministic Finite Automata in the Framework of Active Learning

Learning k-edge Deterministic Finite Automata in the Framework of Active Learning Learning k-edge Deterministic Finite Automata in the Framework of Active Learning Anuchit Jitpattanakul* Department of Mathematics, Faculty of Applied Science, King Mong s University of Technology North

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Large-Scale Training of SVMs with Automata Kernels

Large-Scale Training of SVMs with Automata Kernels Large-Scale Training of SVMs with Automata Kernels Cyril Allauzen, Corinna Cortes, and Mehryar Mohri, Google Research, 76 Ninth Avenue, New York, NY Courant Institute of Mathematical Sciences, 5 Mercer

More information

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009 LMI MODELLING 4. CONVEX LMI MODELLING Didier HENRION LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ Universidad de Valladolid, SP March 2009 Minors A minor of a matrix F is the determinant of a submatrix

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas Department of Mathematics Department of Statistical Science Cornell University London, January 7, 2016 Joint work

More information

Theoretical Guarantees for Learning Weighted Automata

Theoretical Guarantees for Learning Weighted Automata Theoretical Guarantees for Learning Weighted Automata Borja Balle ICGI Keynote October 2016 Thanks To My Collaborators! Mehryar Mohri Ariadna Quattoni Xavier Carreras Prakash Panangaden Doina Precup Outline

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Complexity of Equivalence and Learning for Multiplicity Tree Automata

Complexity of Equivalence and Learning for Multiplicity Tree Automata Journal of Machine Learning Research 16 (2015) 2465-2500 Submitted 11/14; Revised 6/15; Published 12/15 Complexity of Equivalence and Learning for Multiplicity Tree Automata Ines Marušić James Worrell

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

Rank minimization via the γ 2 norm

Rank minimization via the γ 2 norm Rank minimization via the γ 2 norm Troy Lee Columbia University Adi Shraibman Weizmann Institute Rank Minimization Problem Consider the following problem min X rank(x) A i, X b i for i = 1,..., k Arises

More information

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries

Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Using Multiplicity Automata to Identify Transducer Relations from Membership and Equivalence Queries Jose Oncina Dept. Lenguajes y Sistemas Informáticos - Universidad de Alicante oncina@dlsi.ua.es September

More information

Sparsification of Graphs and Matrices

Sparsification of Graphs and Matrices Sparsification of Graphs and Matrices Daniel A. Spielman Yale University joint work with Joshua Batson (MIT) Nikhil Srivastava (MSR) Shang- Hua Teng (USC) HUJI, May 21, 2014 Objective of Sparsification:

More information

Sparse and Low Rank Recovery via Null Space Properties

Sparse and Low Rank Recovery via Null Space Properties Sparse and Low Rank Recovery via Null Space Properties Holger Rauhut Lehrstuhl C für Mathematik (Analysis), RWTH Aachen Convexity, probability and discrete structures, a geometric viewpoint Marne-la-Vallée,

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Learning probability distributions generated by finite-state machines

Learning probability distributions generated by finite-state machines Learning probability distributions generated by finite-state machines Jorge Castro Ricard Gavaldà LARCA Research Group Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya,

More information

Inner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n

Inner products and Norms. Inner product of 2 vectors. Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y x n y n in R n Inner products and Norms Inner product of 2 vectors Inner product of 2 vectors x and y in R n : x 1 y 1 + x 2 y 2 + + x n y n in R n Notation: (x, y) or y T x For complex vectors (x, y) = x 1 ȳ 1 + x 2

More information

Some Selected Topics. Spectral Norm in Learning Theory:

Some Selected Topics. Spectral Norm in Learning Theory: Spectral Norm in Learning Theory Slide 1 Spectral Norm in Learning Theory: Some Selected Topics Hans U. Simon (RUB) Email: simon@lmi.rub.de Homepage: http://www.ruhr-uni-bochum.de/lmi Spectral Norm in

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Mathematics 530. Practice Problems. n + 1 }

Mathematics 530. Practice Problems. n + 1 } Department of Mathematical Sciences University of Delaware Prof. T. Angell October 19, 2015 Mathematics 530 Practice Problems 1. Recall that an indifference relation on a partially ordered set is defined

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Kernel Methods for Learning Languages

Kernel Methods for Learning Languages Kernel Methods for Learning Languages Leonid (Aryeh) Kontorovich a and Corinna Cortes b and Mehryar Mohri c,b a Department of Mathematics Weizmann Institute of Science, Rehovot, Israel 76100 b Google Research,

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

ON SUM OF SQUARES DECOMPOSITION FOR A BIQUADRATIC MATRIX FUNCTION

ON SUM OF SQUARES DECOMPOSITION FOR A BIQUADRATIC MATRIX FUNCTION Annales Univ. Sci. Budapest., Sect. Comp. 33 (2010) 273-284 ON SUM OF SQUARES DECOMPOSITION FOR A BIQUADRATIC MATRIX FUNCTION L. László (Budapest, Hungary) Dedicated to Professor Ferenc Schipp on his 70th

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.

More information

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability

Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability Nuntanut Raksasri Jitkomut Songsiri Department of Electrical Engineering, Faculty of Engineering,

More information

Elementary linear algebra

Elementary linear algebra Chapter 1 Elementary linear algebra 1.1 Vector spaces Vector spaces owe their importance to the fact that so many models arising in the solutions of specific problems turn out to be vector spaces. The

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Matrix Rank Minimization with Applications

Matrix Rank Minimization with Applications Matrix Rank Minimization with Applications Maryam Fazel Haitham Hindi Stephen Boyd Information Systems Lab Electrical Engineering Department Stanford University 8/2001 ACC 01 Outline Rank Minimization

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Recent Developments in Compressed Sensing

Recent Developments in Compressed Sensing Recent Developments in Compressed Sensing M. Vidyasagar Distinguished Professor, IIT Hyderabad m.vidyasagar@iith.ac.in, www.iith.ac.in/ m vidyasagar/ ISL Seminar, Stanford University, 19 April 2018 Outline

More information

Introduction to Algebraic and Geometric Topology Week 3

Introduction to Algebraic and Geometric Topology Week 3 Introduction to Algebraic and Geometric Topology Week 3 Domingo Toledo University of Utah Fall 2017 Lipschitz Maps I Recall f :(X, d)! (X 0, d 0 ) is Lipschitz iff 9C > 0 such that d 0 (f (x), f (y)) apple

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Follow-he-Perturbed Leader MEHRYAR MOHRI MOHRI@ COURAN INSIUE & GOOGLE RESEARCH. General Ideas Linear loss: decomposition as a sum along substructures. sum of edge losses in a

More information

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. . Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. Nemirovski Arkadi.Nemirovski@isye.gatech.edu Linear Optimization Problem,

More information

Finite-State Transducers

Finite-State Transducers Finite-State Transducers - Seminar on Natural Language Processing - Michael Pradel July 6, 2007 Finite-state transducers play an important role in natural language processing. They provide a model for

More information

Learning Multi-Step Predictive State Representations

Learning Multi-Step Predictive State Representations Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Learning Multi-Step Predictive State Representations Lucas Langer McGill University Canada Borja Balle

More information

Convex sets, conic matrix factorizations and conic rank lower bounds

Convex sets, conic matrix factorizations and conic rank lower bounds Convex sets, conic matrix factorizations and conic rank lower bounds Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute

More information

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A). Math 344 Lecture #19 3.5 Normed Linear Spaces Definition 3.5.1. A seminorm on a vector space V over F is a map : V R that for all x, y V and for all α F satisfies (i) x 0 (positivity), (ii) αx = α x (scale

More information

NORMS ON SPACE OF MATRICES

NORMS ON SPACE OF MATRICES NORMS ON SPACE OF MATRICES. Operator Norms on Space of linear maps Let A be an n n real matrix and x 0 be a vector in R n. We would like to use the Picard iteration method to solve for the following system

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Spectral Learning of Sequential Systems

Spectral Learning of Sequential Systems Spectral Learning of Sequential Systems by Michael Thon A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Approved Dissertation Committee:

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 448 (2012) 41 46 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Polynomial characteristic sets

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly. C PROPERTIES OF MATRICES 697 to whether the permutation i 1 i 2 i N is even or odd, respectively Note that I =1 Thus, for a 2 2 matrix, the determinant takes the form A = a 11 a 12 = a a 21 a 11 a 22 a

More information

Domain Adaptation Can Quantity Compensate for Quality?

Domain Adaptation Can Quantity Compensate for Quality? Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

A Pseudo-Boolean Set Covering Machine

A Pseudo-Boolean Set Covering Machine A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence)

Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence) Math 413/513 Chapter 6 (from Friedberg, Insel, & Spence) David Glickenstein December 7, 2015 1 Inner product spaces In this chapter, we will only consider the elds R and C. De nition 1 Let V be a vector

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers International Journal of Foundations of Computer Science c World Scientific Publishing Company Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers Mehryar Mohri mohri@research.att.com

More information