Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Size: px

Start display at page:

Download "Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park"

Marjorie Griffin
6 years ago
Views:

1 Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

2 Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features Search space is limited

3 Related Work POS Tagging Collins 2002, Huang et al Dependency Parsing Charniak and Johnson 2005, Shen et al Machine Translation Watanabe et al Use joint features defined on input and output pair Increases the number of features Feature engineering becomes essential

4 Outline Introduction Low-Dimensional Reranking Models Generative-Style Model Discriminative-Style Model Softened-Discriminative Model Experiments Conclusion

5 Low-dimensional Reranking 1) Run independent feature extractors x y 2) Use training data to find low-dimensional space X = [x 1,...,x n ] ; y 2 Y=[y 1, y n ] x 1 y 1 x 2 x 1 x 3 A B y 1 y 3 y 2 x 4 y 4

6 Ranking Models 1) Generative Style Model 2) Discriminative Style Model 3) Softened-Discriminative Model

7 Ranking Models 1) Generative Style Model Uses only reference outputs i.e. {x i, y i } i=1...n 2) Discriminative Style Model Uses reference and candidate outputs Ensures the highest scoring output is the reference 3) Softened-Discriminative Model {x i, y i, y ij }i=1... n Difficult to solve the Discriminative-Style Model Relaxed version which is tractable

8 1) Generative-Style Model Uses only input, ref output pairs Aim to find A and B such that {x i, y i }i=1... n A T x Projection is nothing but and B T y y 2 x 1 y x 2 1 x 1 x 3 A B y 1 y 3 y 2 x 4 y 4

9 1) Generative-Style Model Uses only input, ref output pairs Aim to find a and b such that a T X Projection is nothing but and {x i, y i }i=1... n b T Y (a,b)=argmax cos( X T a, Y T b) argmax a T X Y T b s.t. X T a = Y T b =1 Similarity measured using Cosine Angle Can be solved exactly using Gen. Eigenvalue problem Use the first k eigenvectors to form A and B 1 2 3

10 Discriminative-Style Models Intuition Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) x 1 y 1 ŷ 2 ŷ 1 ŷ 3 Similarity measured using Cosine Angle A B y 1 ŷ 2 x 1 ŷ 3 ŷ 1

11 Discriminative-Style Models Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) x 1 y 1 ŷ 2 ŷ 1 ŷ 3 A B y 1 ŷ 2 x 1 ŷ 3 ŷ 1

12 Discriminative-Style Models Use ref. ( y 1 ) and candidate ( ŷ 1 ) outputs Aim to find A and B s.t. ref. output is highest scoring y i =argmax ŷ ij cos( A T x i, B T ŷ ij ) 1) Observe that the score is: a T x i T yij b 2)Add constraints to penalize incorrect ones m ij =a T x i y i T b a T x i ŷ ij b m ij 1 ξ i L ij

13 2) Discriminative-Style Model Enforce the individual constraints argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

14 2) Discriminative-Style Model Enforce the individual constraints argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

15 2) Discriminative-Style Model Enforce the individual constraints argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

16 2) Discriminative-Style Model Enforce the individual constraints argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij

17 2) Discriminative-Style Model Enforce the individual constraints argmax a,b,ξ 0 1 λ λ at X Y T b i ξ i a T X X T a = 1 b T Y Y T b = 1 m ij 1 ξ i L ij Harder to solve exactly

18 3) Softened-Discriminative Model Discriminative Model argmax a,b, ξ 0 1 λ λ at X Y T b i ξ i s.t. length constraints m ij 1 ξ i L ij Softened-Discriminative Model argmax a,b (1 λ)a T X Y T b + λ ij L ij m ij s.t. length constraints Similar to Generative Model it can be solved exactly We use it to solve Discriminative-Style Model

19 Optimizing Discriminative-Style Model α ij = L ij Repeat // Initialization [A (i),b (i) ] = Softened-Disc(X,Y, ) m ij If End if = a T x i y i T b a T x i ŷ ij b ψ ij = (1 m ij )L ij ξ i = min {0, ψ ij s.t. ψ ij >0 } ξ i > 0 d ij α ij = m ij (1 ξ i ) L ij α ij γ d ij Until convergence α ij // Get the current soln. // Compute margins // Potential Slack // Compute Slack // Update the Lagrangian variables // Slack doesn't change

20 Outline Introduction Low-Dimensional Reranking Models Generative-Style Model Discriminative-Style Model Softened-Discriminative Model Experiments Conclusion

21 Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B

22 Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B Reranking

23 Reranking for POS Tagging HMM Trigram tagger is used to generate 10-best list Use multi-fold cross validation Training Use one of our models to find mappings A and B Reranking For every candidate tag sequence Compute cosine similarity in the sub-space Combine this score with Viterbi decoding score Interpolation parameter is tuned

24 Experimental Results Data Sets CoNLL shared task on Dependency Parsing Use the fine-grained POS tags Baseline system HMM Trigram Tagger Thede and Harper, 1999 Baseline Rerankers Collins and Koo, 2005 Its regularized version Huang et al. 2007

25 Features For our models: Sentence is represented as bag of suffixes of length 2 to 4 (x) For POS tag-sequences, we use Unigram and Bigram POS tags as features (y) For baseline Rerankers Feature of the form (suffix, Tag 0 ), (suffix,tag 0,Tag -1 )

26 Results English Chinese French Swedish Baseline Collins Regularized Oracle Oracle improvements are less compared to Parsing

27 Results English Chinese French Swedish Baseline Collins Regularized Generative Softened-Disc Discriminative Oracle One of our models performs better

28 Results Performance is not very sensitive to hyperparams Combination with Viterbi Score is crucial English Chinese French Swedish Generative Softened-Disc Discriminative Generative Softened-Disc Discriminative Performance drops by more than 2%

29 Conclusions Family of rerankers based on dim. reduction Max-margin CCA [Szedmak et al. 2007] Max-margin regression [Szedmak et al. 2006,Wang et al. 2007] Supervised Semantic Hasing [Bai et al. 2010,Grangier et al. 2006] Learns a linear classifier in the joint feature space But weights are constrained Perfect when defining the joint features is not clear Scalability Online SVD algorithms [Halko et al. 2009] Adaptive variants of CCA [Yger et al. 2012]

30 Thank You!!

31 Performance with regularization param. (tau) Discriminative Softened-Disc Generative Baseline

32 Performance with weight param. (lambda) Discriminative Softened-Disc Generative Baseline

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for