Search-Based Structured Prediction

Size: px

Start display at page:

Download "Search-Based Structured Prediction"

Erika Butler
5 years ago
Views:

1 Search-Based Structured Prediction by Harold C. Daumé III (Utah), John Langford (Yahoo), and Daniel Marcu (USC) Submitted to Machine Learning, 2007 Presented by: Eugene Weinstein, NYU/Courant Institute October 2nd, 2007

2 Structured Prediction Intro Given: labeled training data Task: learn mapping from inputs Special cases Binary classification: Y = { 1, 1} ution. Unlike the Y = {1,..., k} bered labels, we Multiclass classification: Natural language parsing example: (x 1, y 1 ),..., (x m, y m ) X Y x X to outputs y Y 2

3 Exploiting Structure Naive approach: treat each possible output in Y as discrete label, apply multiclass classification. But: Enumerating all members of Y often intractable Cannot model closeness of examples (changing one node of tree vs. changing the entire tree) Approach: try to exploit structure and dependencies within the output space Represent closeness of outputs using loss function small loss 3 big loss

4 SP Overview Discriminative structured prediction papers typically extend multiclass classification or regression techniques Most classification schemes use SVM-like max-margin linear classifications incorporating loss functions [Taskar, Guestrin, Koller 03], [Tsochantaridis, Hofmann, Joachims, Altun 04] [Sha, Saul 07] Regression formulation of SP: [Cortes, Mohri, Weston 06] Searn is a meta-algorithm. Claim: given multiclass classifier achieving good generalization, Searn does the same for SP 4

5 Search-based SP [Daumé 06] [Daumé, Langford, Marcu 07] Searn: view structured prediction as search problem SP: distribution D over inputs, output costs (x, c) c = Y e.g.: x i is input, c y is the loss for any y to the true label y i Define loss of cost-sensitive classifier as h : X Y L(D, h) = E (x,c) D { ch(x) } View outputs as vectors y = [y (1),..., y (l) ], but classification problems not limited to sequences A classifier defines a path through space of input/output pairs, and training process iteratively refines the classifier 5

6 Searn Specifics We need to provide: Cost-sensitive multiclass learning algorithm Initial classifier Loss function Initial classifier should have low training error, but need not generalize well Could be best path from any standard search algorithm Each Searn iteration finds a classifier that is not as good on the training set, but generalizes a little better 6

7 Searn Training Search state space: (input, partial output): Initial classifier: pick next label that minimizes cost, assuming that all future decisions are also optimal: s = (x, y (1),..., y (l) ) h 0 (s, c) = arg min y (l+1) min y (l+2),...,y (L) c [(y (1),...,y (L) )] Iterative step: use current classifier h to construct a set of examples to train the next classifier; then interpolate For each state, try every possible next output Cost assigned to each output tried is loss difference l h (c, s, a) = E y (s,a,h) c y min a 7 E y (s,a,h)c y

8 Searn Training Illustration y i = 1 l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 E y (s,a,h)c y i = Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

9 Searn Training Illustration y i = 1 2 l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 E y (s,a,h)c y i = Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

10 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 E y (s,a,h)c y i = 6 4 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

11 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 l h = 1 E y (s,a,h)c y i = 6 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

12 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 l h = 1 l h = 0 E y (s,a,h)c y i = 6 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

13 Searn Meta-Algorithm Input: (x 1, y 1 ),..., (x m, y m ), h 0, A while h has a significant dependence on h 0 : Initialize set of cost-sensitive examples: for i 1,..., m Compute prediction: State consists of input and for l 1,..., L s l (x i, y (1),..., y (l) ) for each next output a after : Compute features and add example: Learn and interpolate: Return h with h 0 removed S (y (1),..., y (L) ) h(x i ) Use losses to build up training examples for next iteration s c l s l,a l h (c, s l, a) S f(sl, c ) h A(S); h βh + (1 β)h

14 Algorithm Analysis h i is the classifier trained up to the ith iteration and is the loss of on this iteration s training examples T is the maximum length of any output sequence Theorem: If c max = E (x,c) D max c y and y (average loss over I iterations) then total loss with and iterations is bounded as 2T 3 ln T h i L(D, h last ) L(D, h 0 ) + 2T l avg log T + (1 + log T )c max /T Proof analyses the mixture of old and new classifiers In practice, β can be larger (more aggressive learning) 10 l hi (h i) l avg = 1 I I i=1 l h i (h i ) β = 1/T 3

15 Proof Lemma 1: For classifier h new learned by interpolating and as, if c max = E (x,c) D max, we have h new βh + (1 β)h L(D, h new ) L(D, h) + T βlh CS (h ) β2 T 2 c max Proof: Consider 3 cases: h is never called ( c = 0), is called exactly once ( c = 1), and is called more than once ( c 2 ) Then loss of h new is bounded as L(D, h new ) =P r(c = 0)L(D, h new c = 0) + P r(c = 1)L(D, h new c = 1) y c y h h + P r(c 2)L(D, h new c 2) (1 β) T L(D, h) + T β(1 β) T 1[ ] L(D, h) + l CS h (h ) [ [ + 1 (1 β) T T β(1 β) T 1] c max

16 Proof Cont d [ [ [ ] ] [ [ ] (1 β) T L(D, h) + T β(1 β) T 1[ ] ] L(D, h new ) = L(D, h) + l CS h (h ) [ + 1 (1 β) T T β(1 β) T 1] ( ( ) ) ( c max ( [ ] T ( ) ) T =L(D, D[ h) + T β(1 β) T 1 l CS h (h ) + ] ( 1) i β i L(D, h) ( i i=2 ] ( ) [ ) + 1 (1 β) T T β(1 β) T 1] [ ] c max L(D, [ h) + T βl CS h (h ) ] [ + 1 (1 β) T T β(1 β) T 1] [ ] (c max L(D, h)) [ L(D, h) + T βl CS h (h ] ) [ + 1 (1 β) T T β(1 ( β) T 1] c max ( ( ) T ( ) ) T =L(D, h) + T βl CS h (h ) + ( ( 1) i β i c i ( ) max ) i=2 L(D, h) + T βl CS h (h ) T 2 β 2 c max [Binomial Expansion] [Binomial Expansion] [Keep first term and t β < T/2 ] 12

17 Proof Cont d Lemma 2: After C/β iterations of Searn, the loss of the final classifier learned is bounded as ( ) 1 L(D, h last ) L(D, h 0 ) + CT l avg + c max 2 CT 2 β + T exp( C) Proof: Invoking Lemma 1 repeatedly, we get ( ) L(D, h) L(D, h 0 ) + CT l avg + ( 1 2 CT 2 β If we remove the initial (optimal) classifier, might incur a loss of ; probability of failing after C/β iterations c max T (1 β) C/β T exp[ C] ) 13

18 Experiments Handwriting recognition [Kassel 95] Named entity recognition El presidente de la [Junta de Extremadura] ORG, [Juan Carlos Rodríguez Ibarra] PER, recibirá en la sede de la [Presidencia del Gobierno] ORG extremeño a familiares de varios de los condenados por el proceso [Lasa-Zabala] MISC, entre ellos a [Lourdes Díez Urraca] PER, esposa del ex gobernador civil de [Guipúzcoa] LOC [Julen Elgorriaga] PER ; y a [Antonio Rodríguez Galindo] PER, hermano del general [Enrique Rodríguez Galindo] PER. Syntactic chunking and part-of-speech (POS) tagging [Great American] NP [said] VP [it] NP [increased] VP [its loan-loss reserves] NP [by] PP [$ 93 million] NP [after] PP [reviewing] VP [its loan portfolio] NP, [raising] VP [its total loan and real estate reserves] NP [to] PP [$ 217 million] NP. Great NNP B-NP reserves NNS I-NP portfolio NN I-NP. Ȯ American NNP I-NP said VBD B-VP it PRP B-NP 14 increased VBD B-VP its PRP$ B-NP by IN B-PP $ $ B-NP 93 CD I-NP million CD I-NP after IN B-PP reviewing VBG B-VP its PRP$ B-NP loan-loss NN I-NP loan NN I-NP

19 Experiments ALGORITHM Handwriting NER Chunk C+T Small Large Small Large CLASSIFICATION Perceptron Log Reg SVM-Lin SVM-Quad STRUCTURED Str. Perc CRF SVM struct M 3 N-Lin M 3 N-Quad SEARN Perceptron Log Reg SVM-Lin SVM-Quad

20 Experiments New vine-growth model for sentence summarization DUC 2005 data set: 50 sets of 25 documents each Evaluation: Rouge ( n-gram overlap) vs. human summaries 26 Hal Daumé III et al. ORACLE SEARN BAYESUM Vine Extr Vine Extr D05 D03 Base Best 100 w w Table 2 Summarization results; values are Rouge 2 scores (higher is better). 16

21 Bibliography Harold C. Daumé III, Practical structured learning for natural language processing, Ph.D. Thesis, University of Southern California, Harold C. Daumé III, John Langford, and Daniel Marcu. Search-Based Structured Prediction, Submitted to Machine Learning, 2007 Robert Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, Massachusetts Institute of Technology, Spoken Language Systems Group, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): , Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. Neural Information Processing Systems (NIPS) 16, Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. Proceedings ICML, Fei Sha and Lawrence K. Saul. Large margin hidden Markov models for automatic speech recognition, Neural Information Processing Systems (NIPS) 19, William W. Cohen and Vitor Carvalho. Stacked sequential learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJ-CAI), Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In Proceedings of the Conference of the Association for Computational Linguistics (ACL), Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca Zadrozny. Error limiting reductions between classification tasks. In Proceedings of the International Conference on Machine Learning (ICML), 2005.

Discriminative Methods for Structured Prediction

Discriminative Methods for Structured Prediction Eugene Weinstein, PhD Candidate New York University, Courant Institute Department of Computer Science Depth Qualifying Exam June 20th, 2007 Talk Outline