Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Size: px

Start display at page:

Download "Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania"

Irma Patricia Blair
5 years ago
Views:

1 Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania

2 Structured Prediction Prediction of complex outputs Structured outputs: multivariate, correlated, constrained Novel, general way to solve many learning problems

3 Supervised Learning Learn from Regression: Binary Classification: Multiclass Classification: Structured Prediction:

4 Handwriting Recognition x y structured

5 Object Segmentation x y

6 Articulated Pose Estimation x y Arthur Gretton

7 Natural Language Parsing x y The screen was a sea of red

8 Bilingual Word Alignment x What is the anticipated cost of collecting fees under the new proposal? En vertu des nouvelles propositions, quel est le coût prévu de perception des droits? What is the anticipated cost of collecting fees under the new proposal? y En vertu de les nouvelles propositions, quel est le coût prévu de perception de les droits?

9 Protein Structure and Disulfide Bridges AVITGACERDLQCG KGTCCAVSLWIKSV RVCTPVGTSGEDCH PASHKIPFSGQRMH HTCPCAPNLACVQT SPKKFKCLSK Protein: IMT

10 Local Prediction Using only local information Ignores correlations & constraints!

11 Local Prediction building tree shrub ground

12 Challenges: Local Prediction Anatomical parts are hard to detect Histogram of Gradients (HoG) SVM classifier weights:

13 Local Prediction y x θ left upper arm

14 Challenges: Local Prediction Anatomical parts are hard to detect Histogram of Gradients (HoG) SVM classifier weights:

15 Challenges: Local Prediction Anatomical parts are hard to detect y x θ left lower arm

16 Structured Prediction Use local information + Exploit correlations Respect constraints

17 Structured Prediction building tree shrub ground

18 Structured Prediction

19 Outline Structured prediction models ( CRFs ) Sequences, trees ( CFGs ) Parse Trees ( MRFs Associative Markov networks (Special Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation collecting fee unde th new proposa

20 Structured Models scoring function Mild assumption: space of feasible outputs linear combination

21 (* CRF Chain Markov Net (aka y a-z a-z a-z a-z a-z x Lafferty et al.

22 (* CRF Chain Markov Net (aka y a-z a-z a-z a-z a-z x *Lafferty et al.

23 Associative Markov Nets Point features spin-images, point height Edge features length of edge, edge orientation associative restriction φ j y j φ jk y k

24 CFG Parsing ( NN #(NP DT ( NP #(PP IN ( sea #(NN

25 Bilingual Word Alignment What is the anticipated cost of collecting fees under the new proposal? j k En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits? position orthography association

26 Structured Models scoring function Mild assumptions: space of feasible outputs linear combination sum of part scores

27 Supervised Structured Prediction Model: Data Learning Prediction Estimate w Local Margin (ignores structure) Likelihood ( intractable (can be Example: Weighted matching Generally: Combinatorial optimization

28 Local Estimation Model: Data Treat edges as independent decisions Estimate w locally, use globally E.g., naïve Bayes, SVM, logistic regression Simple and cheap Not well-calibrated for inference Ignores correlations & constraints

Jerrum & Sinclair 93] Tractable model, intractable

29 Conditional Likelihood Estimation Model: Data Estimate w jointly: Denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93] Tractable model, intractable learning Need tractable learning method margin-based estimation

30 Outline Structured prediction models ( CRFs ) Sequences, Trees ( CFGs ) Parse Trees ( MRFs Associative Markov networks (Special Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation collecting fee unde th new proposa

31 OCR Example We want: brace Equivalently: brace aaaaa brace aaaab a lot! brace zzzzz

32 Parsing Example We want: It was red S A B C D Equivalently: It was red S A B C D It was red S A B D F It was red S A B C D It was red S A B C D a lot! It was red S A B C D It was red S E F G H

33 Alignment Example We want: What is the Quel est le Equivalently: What is the Quel est le What is the Quel est le What is the Quel est le What is the Quel est le a lot! What is the Quel est le What is the Quel est le

34 Structured Loss b c a r e b r o r e b r o c e b r a c e It was red S A B C D S A E C D S B D A C S B E A C What is the Quel est le

35 Large margin estimation Given training examples, we want: Maximize margin Mistake weighted margin: *Collins 2, Altun et al 3, Taskar 3 # of mistakes in y

36 Large margin estimation Eliminate ( loss Add slacks for inseparable case (hinge

37 Large margin estimation Brute force enumeration Min-max formulation Plug-in linear program for inference

38 Min-max formulation Structured loss (Hamming): Inference LP Inference Key step: discrete optim. continuous optim.

39 Alternatives: Perceptron Simple iterative method Unstable for structured output: fewer instances, big updates May not converge if non-separable Noisy Voted / averaged perceptron [Freund & Schapire 99, Collins 2] Regularize / reduce variance by aggregating over iterations

40 Alternatives: Constraint Generation Add most violated constraint [Collins 2; Altun et al, 3] Handles several more general loss functions Need to re-solve QP many times Theorem: Only polynomial # of constraints needed to achieve ε-error [Tsochantaridis et al, 4] Worst case # of constraints larger than factored

41 Outline Structured prediction models ( CRFs ) Sequences ( CFGs ) Trees ( MRFs Associative Markov networks (Special Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation collecting fee unde th new proposa

42 Matching Inference LP Need Hamming-like loss What is the anticipated cost of collecting fees under the new proposal? j k En vertu de les nouvelles propositions, quel est le coût prévu de perception de le droits? degree Has integral solutions z ( unimodular (A is totally [Nemhauser+Wolsey 88]

43 y z Map for Markov Nets..... : : : : : z : b a z : b a z. b a z. b a z. b a z. b a

44 Markov Net Inference LP normalization agreement Has integral solutions z for chains, (hyper)trees Can be fractional for untriangulated networks [Chekuri+al, Wainright+al 2]

45 Associative MN Inference LP associative restriction ( optimal ) For K=2, solutions are always integral ( cliques For K>2, within factor of 2 of optimal (results for larger [Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 2, Taskar+al 4]

46 CFG Chart CNF tree = set of two types of parts: ( e Constituents (A, s, ( e CF-rules (A B C, s, m,

47 CFG Inference LP root inside outside Has integral solutions z

48 LP Duality Linear programming duality Variables constraints Constraints variables Optimal values are the same When both feasible regions are bounded

49 Min-max Formulation LP duality

50 Min-max formulation summary Formulation produces concise QP for Low-treewidth Markov networks ( K=2 ) Associative MNs Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2 *Taskar et al 4

51 Unfactored Primal/Dual QP duality Exponentially many constraints/variables

52 Factored Primal/Dual By QP duality Dual inherits structure from problem-specific inference LP Variables μ correspond to a decomposition of α variables of the flat case

53 The Connection b c a r e b r o r e b r o c e b r a c e r c a o c r b e μ

54 Duals and Kernels Kernel trick works: Factored dual Local functions (log-potentials) can use kernels

55 3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 3 thousand points Testing: 3 million points

60 Segmentation results Hand labeled 8K test points Model SVM V-SVM M 3 N Accuracy 68% 73% 93%

61 Hypertext Classification WebKB dataset Four CS department websites: 3 pages/35 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth 2 better loopy belief propagation *Taskar et al 2 Test Error % error reduction over SVMs 38% error reduction over RMNs SVMs RMNS M^3Ns relaxed LP

62 Word Alignment Results Data: [Hansards Canadian Parliament] Trained on sentences (, edges) ( edges Tested on 35 sentences (35, Error: weighted combination of precision/recall Model GIZA/IBM4 [Och & Ney 3] +Local learning+matching +Our approach +Our approach+qap Error collecting fees unde collecting fee unde the new th new proposa proposa? [Taskar+al 5] [Lacoste-Julien+Taskar+al 6] [Vision Apps: T. S. Caetano + al. ICCV 27]

63 Modeling First Order Effects Monotonicity Local inversion Local fertility QAP NP-complete ( Mosek ) Sentences ( 3 words, k vars) few seconds Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral

Outline Structured prediction models ( CRFs ) Sequences

(Special Matchings Structured large margin estimation

programming inference Certificate formulation What is

64 Outline Structured prediction models ( CRFs ) Sequences ( CFGs ) Trees ( MRFs Associative Markov networks (Special Matchings Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation What is the anticipated cost of collecting fees under the new proposal? collecting En vertu fee de les nouvelles unde propositions, quel th est le new coût prévu proposa de perception de les droits?

65 Certificate formulation Non-bipartite matchings: O(n 3 ) combinatorial algorithm No polynomial-size LP known Spanning trees No polynomial-size LP known Simple certificate of optimality kl ij Intuition: Verifying optimality easier than optimizing Compact optimality condition of wrt.

66 Formulation summary Brute force enumeration Min-max formulation Plug-in convex program for inference Certificate formulation Directly guarantee optimality of

Scalable Algorithms Convex quadratic program # variables and constraints linear in # parameters, edges Can solve using off-the-shelf software Matlab, CPLEX, Mosek, etc.

67 Scalable Algorithms Convex quadratic program # variables and constraints linear in # parameters, edges Can solve using off-the-shelf software Matlab, CPLEX, Mosek, etc. Superlinear convergence Problem: linear is too large ( quadratic ) Second-order methods run out of memory Need scalable memory-efficient methods Space/time tradeoff Structured SMO [Taskar+al 4] Structured exponentiated gradient [Bartlett+al 4, Collins+al 7] Don t work for matchings, min-cuts

68 Other approaches Online methods Online updates with respect to most violated constraints [Crammer+al 5,6] Regression based methods Regression from input to transformed output space [Cortes+al 7] Learning to search Learn classifier to guide local search for structured solution [Daume+al 5] Many others

69 Generalization Bounds If the past any indication of the future, he ll have a cruller.

70 Generalization Bounds

71 Several Pointers Perceptron bound [Colllins ] Assume separability with margin γ Bound on - loss Covering-number bound [Taskar+al 3] Bound on Hamming loss Logarithmic dependence on # variables in each y Regret Bounds [Crammer+al 6] Online-style guarantees for more general loss PAC-Bayes bound [McAllester 7] Tighter analysis, consistency Bounds for Learning with Approximate Inference [Kulesza & Pereira, 7]

72 Open Questions for Large-Margin Estimation Statistical consistency Hinge loss not consistent for non-binary output [See Tewari & Bartlett 5, McAllester 7] Semi-supervised Laplacian-regularization [Altun+McAllester 5] Co-regularization [Brefeld+al 5] Latent variables Machine Translation [Liang+al 6] CCG Parsing to Logical Form [Zettlemoyer+Collins 7] Latent SVM [Ramanan et al 8] Learning with approximate inference

73 Learning with LP relaxations Does constant factor approximate inference guarantee anything a-priori about learning? No [See Kulesza & Pereira 7] Simple 3-node counter example Separable with exact inference, not separable with approximate Question: What other (stronger?) approximate inference guarantees will translate into learning guarantees?

74 Thanks!

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,