Structured Prediction Theory and Algorithms
|
|
- Cecil Long
- 6 years ago
- Views:
Transcription
1 Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.
2 Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. L(y, y 0 )= 1 l Example: edit-distance loss. lx 1 yk 6=yk 0. k=1 L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2
3 Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3
4 Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4
5 Examples: Image Segmentation page 5
6 Predictors Family of scoring functions H mapping from X Y to R. For any h 2 H, prediction based on highest score: 8x 2 X, h(x) = argmax y2y h(x, y). Decomposition as a sum modeled by factor graphs. page 6
7 Factor Graph Examples Pairwise Markov network decomposition: 1 f 1 2 f 2 3 h(x, y) =h f1 (x, y 1,y 2 )+h f2 (x, y 2,y 3 ). Other decomposition: h(x, y) =h f1 (x, y 1,y 3 )+ h f2 (x, y 1,y 2,y 3 ). 2 f 2 1 f 1 3 page 7
8 Factor Graphs G =(V,F,E) : factor graph. N(f) : neighborhood of f. Y f = Q k2n(f) Y k : substructure set cross-product at f. Decomposition: h(x, y) = X f2f h f (x, y f ). More generally, example-dependent factor graph, G i = G(x i,y i )=(V i,f i,e i ). page 8
9 Linear Hypotheses Feature decomposition Example: bigram decomposition. (x, y) = lx s=1 y: D N V D N x: his cat ate the fish k: 4 (x, s, y s 1,y s ). h(x, y) =w (x, y) = lx s=1 Hypothesis decomposition. (x, 4,y 3,y 4 ) w (x, s, y s 1,y s ). {z } h s (x,y s 1,y s ) page 9
10 Structured Prediction Problem Training data: sample drawn i.i.d. from X Y some distribution D, according to S =((x 1,y 1 ),...,(x m,y m )) 2 X Y. Problem: find hypothesis h: X Y! R in H with small expected loss: R(h) = learning guarantees? role of factor graph? better algorithms? E [L(h(x),y)]. (x,y) D page 10
11 This Talk Theory. Voted risk minimization (VRM). Algorithms. Experiments. page 11
12 Theory
13 Previous Work Standard multi-class learning bounds: number of classes is exponential! Structured prediction bounds: covering number bounds: Hamming loss, linear hypotheses (Taskar et al., 2003). PAC-Bayesian bounds (randomized algorithms) (David McAllester, 2007). can we derive learning guarantees for general hypothesis sets and general loss functions? page 13
14 Factor Graph Complexity Empirical factor graph complexity for hypothesis set H and sample S =(x 1,...,x m ) : " # br G S (H) = 1 mx m E X X p Fi i,f,y h f (x i,y) = E 2 4sup h2h sup h2h 1 m i=1 Factor graph complexity: f2f i ". i,f,y. R G m(h) = # y2y f 2 4. p Fi h f (x i,y). {z } correlation with random noise h i E br G S (H). S D m page 14
15 Margin Definition: the margin of h at a labeled point (x, y) 2 X Y is h (x, y, y 0 )=min h(x, y) h(x, y 0 6=y y0 ). error when h (x, y, y 0 ) apple 0. small margin interpreted as low confidence. page 15
16 Empirical Margin Losses For any > 0, br add S, (h) = E (x,y) S br mult S, (h)= E (x,y) S apple apple M M max y 0 6=y L(y0,y) max y 0 6=y L(y0,y) 1 h(x,y) h(x,y 0 ) h(x,y) h(x,y 0 ), M (u) M 0 M u page 16
17 Generalization Bounds Theorem: for any > 0, with probability at least 1, each of the following holds for all h 2 H: s R(h) apple R b S, add (h)+ 4p 2 log 1 RG m(h)+m 2m, s R(h) apple R b S, mult (h)+ 4p 2M R G log 1 m(h)+m 2m. tightest margin bounds for structured prediction. data-dependent. improve upon bound of (Taskar et al., 2003) by log terms (in the special case they study). page 17
18 Linear Hypotheses Hypothesis set used by most convex structured prediction algorithms (StructSVM, M3N, CRF): o H p = nx 7! w (x, y): w 2 R N, kwk p apple p, with p 1 and (x, y) = X f2f f (x, y f ). page 18
19 Complexity Bounds Bounds on factor graph complexity of linear hypothesis sets: br G S (H 1 ) apple p 1r 1 s log(2n) m q Pm P 2 r 2 br G i=1 Pf2F i y2y f F i S (H 2 ) apple m with r q = max k f (x i,y)k q i,f,y mx X X s = max F i 1 f,j(x i,y)6=0. j2[1,n] i=1 y2y f f2f i page 19
20 Key Term Sparsity parameter: mx X X s apple F i apple i=1 y2y f f2f i where d i = max f2f i Y f. mx i=1 F i 2 d i apple m max i F i 2 d i, factor graph complexity in for H 1 hypothesis set. key term: average factor graph size. O p log(n) max i F i 2 d i /m page 20
21 NLP Applications Features: f,j is often a binary function, non-zero for a single pair (x, y) 2 X Y f. example: presence of n-gram (indexed by j ) at position f of the output with input sentence. complexity term only in O max F i p log(n)/m. i x i page 21
22 Theory Takeaways Key generalization terms: average size of factor graphs. empirical margin loss. But, is learning with very complex hypothesis sets (factor graph complexity) possible? richer families needed for difficult NLP tasks. but generalization bound indicates risk of overfitting. Voted Risk Minimization (VRM) theory. page 22
23 Voted Risk Minimization
24 Decomposition of H Decomposition in terms of sub-families. H 2 H 4 H 1 H 3 H 5 page 24
25 Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page 25
26 Ideas Use hypotheses drawn from s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? H k (Cortes, MM, and Syed, 2014) can we provide learning guarantees guiding these choices? page 26
27 Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : s! R(f) R badd S,,1(f) apple 4p 2 TX t R G m(h kt )+ O e log p M 2 m t=1 s! R(f) R bmult S,,1(f) apple 4p 2M TX t R G m(h kt )+ O e log p M 2. m t=1 page 27
28 Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to directly define an ensemble algorithm. page 28
29 Algorithms
30 Surrogate Loss Framework Lemma: assume that u 1 vapple0 apple u (v) for any and v 2 R. Then, for any (x, y) 2 X Y, u 2 R + L(h(x),y) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). Proof: if h(x) =y, then L(h(x),y)=0and result is trivial. Otherwise, h(x) 6= y and L(h(x),y)=L(h(x),y)1 h(x,y) maxy 0 6=y h(x,y 0 )apple0 apple L(h(x),y) (h(x, y) max y 0 6=y h(x, y0 )) ( u (v) upper bound on u1 vapple0 ) = L(h(x),y) (h(x, y) h(x, h(x))) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). (h(x) 6= y) page 30
31 Application Convex surrogate losses: u(v) = max(0,u(1 v)) : StructSVM (Tsochantaridis et al., 2005). u(v) = max(0,u v) : M3N (Taskar et al., 2003). u(v) = log(1 + e u v ): CRF (Lafferty et al., 2003). u(v) =ue v : StructBoost (Cortes et al., 2016). page 31
32 Voted Cond. Random Field Hypothesis set: linear functions: h: (x, y) 7! w (x, y). complex feature vector. apple 1 decomposition in blocks: =. Upper bound: max log(1 + ) w ( (x,y) (x,y 0 )) ) y 0 6=y el(y,y0 X apple log e L(y,y0 ) w ( (x,y) (x,y 0 )). y 0 2Y. p page 32
33 Voted Cond. Random Field Optimization problem (VCRF): min w 1 m mx X log i=1 y2y e L(y,y i) w ( (x i,y i ) (x i,y)) + px ( r k + )kw k k 1, k=1 with r k = r 1 F (k) p log N. solution via stochastic gradient descent (SGD). relationship with L1-CRF. other regularization, e.g., L2-VCRF. efficient gradient computation for Markovian features. page 33
34 Experiments
35 Preliminary Experiments Part-of-speech tagging. Multiple data sets. Dataset Full name Sentences Tokens Unique tokens Labels Basque Basque UD Treebank Chinese Chinese Treebank Dutch UD Dutch Treebank English UD English Web Treebank Finnish Finnish UD Treebank Finnish-FTB UD Finnish-FTB Hindi UD Hindi Treebank Tamil UD Tamil Treebank Turkish METU-Sabanci Turkish Treebank Twitter Tweebank page 35
36 Features - Example y: DET NN VBD RB JJ x: the cat was surprisingly agile s: h 1 (x) =1 x2 = was,x 3 = surprisingly,x 4 = agile (x) h 2 (y) =1 y2 = VBD,y 3 = RB (y) h 3 (x) =1 su (x3,2)= ly (x). page 36
37 Features Feature families: definition: for each choice of the window sizes ( k 1,k 2,k 3 ), sum of products of indicators over positions along the sequence. complexity: r(h k1,k 2,k 3 ) apple r 2(k1 log V + k 2 log m + k 3 log. page 37
38 Experiments Parameters and determined via cross-validation. Comparison with L1-CRF. Two sets of results: original data sets. artificial noise added: tokens corresponding to features that commonly appear in the dataset (at least five times), POS labels flipped with some probability (20% noise). page 38
39 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 7.26 ± ± ± ± 1.39 Chinese 7.38 ± ± ± ± 0.49 Dutch 5.97 ± ± ± ± 1.02 English 5.51 ± ± ± ± 1.31 Finnish 7.48 ± ± ± ± 1.36 Finnish-FTB 9.79 ± ± ± ± 0.75 Hindi 4.84 ± ± ± ± 0.75 Tamil ± ± ± ± 1.54 Turkish ± ± ± ± 1.01 Twitter ± ± ± ± 1.37 page 39
40 Average No. of Features Dataset VCRF CRF Ratio Basque Chinese Dutch English Finnish Finnish-FTB Hindi Tamil Turkish Twitter page 40
41 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 9.13 ± ± ± ± 1.08 Chinese ± ± ± ± 0.01 Dutch 8.16 ± ± ± ± 0.87 English 8.79 ± ± ± ± 1.18 Finnish 9.38 ± ± ± ± 0.93 Finnish-FTB ± ± ± ± 1.19 Hindi 6.63 ± ± ± ± 1.20 Tamil ± ± ± ± 1.78 Turkish ± ± ± ± 2.04 Twitter ± ± ± ± 0.00 page 41
42 Conclusion Structured prediction theory: tightest margin guarantees for structured prediction. general loss functions, data-dependent. key notion of factor graph complexity. VCRF and StructBoost algorithms. favorable preliminary experiments. guarantees for complex hypothesis sets (VRM theory). additionally, tightest margin bounds for standard classification. page 42
On-Line Learning with Path Experts and Non-Additive Losses
On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationBoosting Ensembles of Structured Prediction Rules
Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationLearning Weighted Automata
Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationMulticlass and Introduction to Structured Prediction
Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49 Contents
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationMulticlass and Introduction to Structured Prediction
Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48 Introduction David S. Rosenberg
More informationStructured Prediction Theory Based on Factor Graph Complexity
Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationMulticlass Classification
Multiclass Classification David Rosenberg New York University March 7, 2017 David Rosenberg (New York University) DS-GA 1003 March 7, 2017 1 / 52 Introduction Introduction David Rosenberg (New York University)
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationFoundations of Machine Learning
Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Motivation Probabilistic models: density estimation. classification. page 2 This
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More information2.2 Structured Prediction
The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationEstimating the accuracy of a hypothesis Setting. Assume a binary classification setting
Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationIntroduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley Significance in NLP You develop a new method for text classification; is it better than what comes
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationPolyhedral Outer Approximations with Application to Natural Language Parsing
Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationProbabilistic Models for Sequence Labeling
Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationNatural Language Processing
Natural Language Processing Info 159/259 Lecture 12: Features and hypothesis tests (Oct 3, 2017) David Bamman, UC Berkeley Announcements No office hours for DB this Friday (email if you d like to chat)
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationAN ABSTRACT OF THE DISSERTATION OF
AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationSpeech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationPAC-Bayesian Generalization Bound for Multi-class Learning
PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma
More informationGeneralization error bounds for classifiers trained with interdependent data
Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationBias-Variance in Machine Learning
Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on
More informationCMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationConditional Random Fields: An Introduction
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMidterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17
Midterms PAC Learning SVM Kernels+Boost Decision Trees 1 Grades are on a curve Midterms Will be available at the TA sessions this week Projects feedback has been sent. Recall that this is 25% of your grade!
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationComputational Oracle Inequalities for Large Scale Model Selection Problems
for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.
More informationEXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION
EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION by Porus Patell Bachelor of Engineering, University of Mumbai, 2009 a project report submitted in partial fulfillment
More informationRegret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More information