Structured Prediction Theory and Algorithms

Size: px
Start display at page:

Download "Structured Prediction Theory and Algorithms"


1 Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. L(y, y 0 )= 1 l Example: edit-distance loss. lx 1 yk 6=yk 0. k=1 L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2

3 Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3

4 Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4

5 Examples: Image Segmentation page 5

6 Predictors Family of scoring functions H mapping from X Y to R. For any h 2 H, prediction based on highest score: 8x 2 X, h(x) = argmax y2y h(x, y). Decomposition as a sum modeled by factor graphs. page 6

7 Factor Graph Examples Pairwise Markov network decomposition: 1 f 1 2 f 2 3 h(x, y) =h f1 (x, y 1,y 2 )+h f2 (x, y 2,y 3 ). Other decomposition: h(x, y) =h f1 (x, y 1,y 3 )+ h f2 (x, y 1,y 2,y 3 ). 2 f 2 1 f 1 3 page 7

8 Factor Graphs G =(V,F,E) : factor graph. N(f) : neighborhood of f. Y f = Q k2n(f) Y k : substructure set cross-product at f. Decomposition: h(x, y) = X f2f h f (x, y f ). More generally, example-dependent factor graph, G i = G(x i,y i )=(V i,f i,e i ). page 8

9 Linear Hypotheses Feature decomposition Example: bigram decomposition. (x, y) = lx s=1 y: D N V D N x: his cat ate the fish k: 4 (x, s, y s 1,y s ). h(x, y) =w (x, y) = lx s=1 Hypothesis decomposition. (x, 4,y 3,y 4 ) w (x, s, y s 1,y s ). {z } h s (x,y s 1,y s ) page 9

10 Structured Prediction Problem Training data: sample drawn i.i.d. from X Y some distribution D, according to S =((x 1,y 1 ),...,(x m,y m )) 2 X Y. Problem: find hypothesis h: X Y! R in H with small expected loss: R(h) = learning guarantees? role of factor graph? better algorithms? E [L(h(x),y)]. (x,y) D page 10

11 This Talk Theory. Voted risk minimization (VRM). Algorithms. Experiments. page 11

12 Theory

13 Previous Work Standard multi-class learning bounds: number of classes is exponential! Structured prediction bounds: covering number bounds: Hamming loss, linear hypotheses (Taskar et al., 2003). PAC-Bayesian bounds (randomized algorithms) (David McAllester, 2007). can we derive learning guarantees for general hypothesis sets and general loss functions? page 13

14 Factor Graph Complexity Empirical factor graph complexity for hypothesis set H and sample S =(x 1,...,x m ) : " # br G S (H) = 1 mx m E X X p Fi i,f,y h f (x i,y) = E 2 4sup h2h sup h2h 1 m i=1 Factor graph complexity: f2f i ". i,f,y. R G m(h) = # y2y f 2 4. p Fi h f (x i,y). {z } correlation with random noise h i E br G S (H). S D m page 14

15 Margin Definition: the margin of h at a labeled point (x, y) 2 X Y is h (x, y, y 0 )=min h(x, y) h(x, y 0 6=y y0 ). error when h (x, y, y 0 ) apple 0. small margin interpreted as low confidence. page 15

16 Empirical Margin Losses For any > 0, br add S, (h) = E (x,y) S br mult S, (h)= E (x,y) S apple apple M M max y 0 6=y L(y0,y) max y 0 6=y L(y0,y) 1 h(x,y) h(x,y 0 ) h(x,y) h(x,y 0 ), M (u) M 0 M u page 16

17 Generalization Bounds Theorem: for any > 0, with probability at least 1, each of the following holds for all h 2 H: s R(h) apple R b S, add (h)+ 4p 2 log 1 RG m(h)+m 2m, s R(h) apple R b S, mult (h)+ 4p 2M R G log 1 m(h)+m 2m. tightest margin bounds for structured prediction. data-dependent. improve upon bound of (Taskar et al., 2003) by log terms (in the special case they study). page 17

18 Linear Hypotheses Hypothesis set used by most convex structured prediction algorithms (StructSVM, M3N, CRF): o H p = nx 7! w (x, y): w 2 R N, kwk p apple p, with p 1 and (x, y) = X f2f f (x, y f ). page 18

19 Complexity Bounds Bounds on factor graph complexity of linear hypothesis sets: br G S (H 1 ) apple p 1r 1 s log(2n) m q Pm P 2 r 2 br G i=1 Pf2F i y2y f F i S (H 2 ) apple m with r q = max k f (x i,y)k q i,f,y mx X X s = max F i 1 f,j(x i,y)6=0. j2[1,n] i=1 y2y f f2f i page 19

20 Key Term Sparsity parameter: mx X X s apple F i apple i=1 y2y f f2f i where d i = max f2f i Y f. mx i=1 F i 2 d i apple m max i F i 2 d i, factor graph complexity in for H 1 hypothesis set. key term: average factor graph size. O p log(n) max i F i 2 d i /m page 20

21 NLP Applications Features: f,j is often a binary function, non-zero for a single pair (x, y) 2 X Y f. example: presence of n-gram (indexed by j ) at position f of the output with input sentence. complexity term only in O max F i p log(n)/m. i x i page 21

22 Theory Takeaways Key generalization terms: average size of factor graphs. empirical margin loss. But, is learning with very complex hypothesis sets (factor graph complexity) possible? richer families needed for difficult NLP tasks. but generalization bound indicates risk of overfitting. Voted Risk Minimization (VRM) theory. page 22

23 Voted Risk Minimization

24 Decomposition of H Decomposition in terms of sub-families. H 2 H 4 H 1 H 3 H 5 page 24

25 Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page 25

26 Ideas Use hypotheses drawn from s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? H k (Cortes, MM, and Syed, 2014) can we provide learning guarantees guiding these choices? page 26

27 Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : s! R(f) R badd S,,1(f) apple 4p 2 TX t R G m(h kt )+ O e log p M 2 m t=1 s! R(f) R bmult S,,1(f) apple 4p 2M TX t R G m(h kt )+ O e log p M 2. m t=1 page 27

28 Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to directly define an ensemble algorithm. page 28

29 Algorithms

30 Surrogate Loss Framework Lemma: assume that u 1 vapple0 apple u (v) for any and v 2 R. Then, for any (x, y) 2 X Y, u 2 R + L(h(x),y) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). Proof: if h(x) =y, then L(h(x),y)=0and result is trivial. Otherwise, h(x) 6= y and L(h(x),y)=L(h(x),y)1 h(x,y) maxy 0 6=y h(x,y 0 )apple0 apple L(h(x),y) (h(x, y) max y 0 6=y h(x, y0 )) ( u (v) upper bound on u1 vapple0 ) = L(h(x),y) (h(x, y) h(x, h(x))) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). (h(x) 6= y) page 30

31 Application Convex surrogate losses: u(v) = max(0,u(1 v)) : StructSVM (Tsochantaridis et al., 2005). u(v) = max(0,u v) : M3N (Taskar et al., 2003). u(v) = log(1 + e u v ): CRF (Lafferty et al., 2003). u(v) =ue v : StructBoost (Cortes et al., 2016). page 31

32 Voted Cond. Random Field Hypothesis set: linear functions: h: (x, y) 7! w (x, y). complex feature vector. apple 1 decomposition in blocks: =. Upper bound: max log(1 + ) w ( (x,y) (x,y 0 )) ) y 0 6=y el(y,y0 X apple log e L(y,y0 ) w ( (x,y) (x,y 0 )). y 0 2Y. p page 32

33 Voted Cond. Random Field Optimization problem (VCRF): min w 1 m mx X log i=1 y2y e L(y,y i) w ( (x i,y i ) (x i,y)) + px ( r k + )kw k k 1, k=1 with r k = r 1 F (k) p log N. solution via stochastic gradient descent (SGD). relationship with L1-CRF. other regularization, e.g., L2-VCRF. efficient gradient computation for Markovian features. page 33

34 Experiments

35 Preliminary Experiments Part-of-speech tagging. Multiple data sets. Dataset Full name Sentences Tokens Unique tokens Labels Basque Basque UD Treebank Chinese Chinese Treebank Dutch UD Dutch Treebank English UD English Web Treebank Finnish Finnish UD Treebank Finnish-FTB UD Finnish-FTB Hindi UD Hindi Treebank Tamil UD Tamil Treebank Turkish METU-Sabanci Turkish Treebank Twitter Tweebank page 35

36 Features - Example y: DET NN VBD RB JJ x: the cat was surprisingly agile s: h 1 (x) =1 x2 = was,x 3 = surprisingly,x 4 = agile (x) h 2 (y) =1 y2 = VBD,y 3 = RB (y) h 3 (x) =1 su (x3,2)= ly (x). page 36

37 Features Feature families: definition: for each choice of the window sizes ( k 1,k 2,k 3 ), sum of products of indicators over positions along the sequence. complexity: r(h k1,k 2,k 3 ) apple r 2(k1 log V + k 2 log m + k 3 log. page 37

38 Experiments Parameters and determined via cross-validation. Comparison with L1-CRF. Two sets of results: original data sets. artificial noise added: tokens corresponding to features that commonly appear in the dataset (at least five times), POS labels flipped with some probability (20% noise). page 38

39 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 7.26 ± ± ± ± 1.39 Chinese 7.38 ± ± ± ± 0.49 Dutch 5.97 ± ± ± ± 1.02 English 5.51 ± ± ± ± 1.31 Finnish 7.48 ± ± ± ± 1.36 Finnish-FTB 9.79 ± ± ± ± 0.75 Hindi 4.84 ± ± ± ± 0.75 Tamil ± ± ± ± 1.54 Turkish ± ± ± ± 1.01 Twitter ± ± ± ± 1.37 page 39

40 Average No. of Features Dataset VCRF CRF Ratio Basque Chinese Dutch English Finnish Finnish-FTB Hindi Tamil Turkish Twitter page 40

41 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 9.13 ± ± ± ± 1.08 Chinese ± ± ± ± 0.01 Dutch 8.16 ± ± ± ± 0.87 English 8.79 ± ± ± ± 1.18 Finnish 9.38 ± ± ± ± 0.93 Finnish-FTB ± ± ± ± 1.19 Hindi 6.63 ± ± ± ± 1.20 Tamil ± ± ± ± 1.78 Turkish ± ± ± ± 2.04 Twitter ± ± ± ± 0.00 page 41

42 Conclusion Structured prediction theory: tightest margin guarantees for structured prediction. general loss functions, data-dependent. key notion of factor graph complexity. VCRF and StructBoost algorithms. favorable preliminary experiments. guarantees for complex hypothesis sets (VRM theory). additionally, tightest margin bounds for standard classification. page 42

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research Mehryar Mohri Courant Institute and Google Motivation Applications: distinct training and test distributions.

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima DEPARTMENT

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima DEPARTMENT

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49 Contents

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48 Introduction David S. Rosenberg

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia Couse webpage: CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research Motivation Real-world problems often have multiple classes: text, speech,

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Multiclass Classification

Multiclass Classification Multiclass Classification David Rosenberg New York University March 7, 2017 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 1 / 52 Introduction Introduction David Rosenberg (New York University)

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Foundations of Machine Learning

Foundations of Machine Learning Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research page 1 Motivation Probabilistic models: density estimation. classification. page 2 This

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information


INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group The POS Tagging Problem 2 England NNP s POS fencers

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson ( Yishay Mansour ( Teaching Assistance: Regev Schweiger

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 Brendan O Connor ( 1 Models for

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: Naïve Bayes

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA Ruth Urner College

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Polyhedral Outer Approximations with Application to Natural Language Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 12: Features and hypothesis tests (Oct 3, 2017) David Bamman, UC Berkeley Announcements No office hours for DB this Friday (email if you d like to chat)

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information


AN ABSTRACT OF THE DISSERTATION OF AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 / Agenda Combining Classifiers Empirical view Theoretical

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology Statistical Methods for NLP 1(22) Structured Classification

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information



More information

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research This Lecture Information theory basics Maximum entropy models Duality theorem

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

PAC-Bayesian Generalization Bound for Multi-class Learning

PAC-Bayesian Generalization Bound for Multi-class Learning PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco

More information

Generalization error bounds for classifiers trained with interdependent data

Generalization error bounds for classifiers trained with interdependent data Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17 Midterms PAC Learning SVM Kernels+Boost Decision Trees 1 Grades are on a curve Midterms Will be available at the TA sessions this week Projects feedback has been sent. Recall that this is 25% of your grade!

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 Afshin Rostamizadeh Department

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Computational Oracle Inequalities for Large Scale Model Selection Problems

Computational Oracle Inequalities for Large Scale Model Selection Problems for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.

More information



More information

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology Statistical Methods for NLP 1(21) Introduction Structured

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information