Multi-Task Structured Prediction for Entity Analysis: Search Based Learning Algorithms

Size: px

Start display at page:

Download "Multi-Task Structured Prediction for Entity Analysis: Search Based Learning Algorithms"

Preston Townsend
5 years ago
Views:

1 Multi-Task Structured Prediction for Entity Analysis: Search Based Learning Algorithms Chao Ma, Janardhan Rao Doppa, Prasad Tadepalli, Hamed Shahbazi, Xiaoli Fern Oregon State University Washington State University Nov. 17 th, 2017

2 Entity Analysis in Language Processing Many NLP tasks process mentions of entities things, people, organizations, etc. Named Entity Recognition Coreference Resolution Entity Linking Semantic Role Labeling Entity Relation Extraction

3 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago

4 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link ycoref = ( Columbia, Columbia University, )

5 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = (?,?,?,?,?,?,? )

6 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1,?,?,?,?,?,? )

7 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1,?,?,?,?,? )

8 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2,?,?,?,? )

9 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2, 4,?,?,? )

10 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2, 4, 5,?,? )

11 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2, 4, 5, 6,? )

12 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2, 4, 5, 6, 7 )

13 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link Columbia Columbia University Left-linking Tree formulation for coreference resolution: m 1 m 2 m 3 m 4 m 5 m 6 m 7 ycoref = ( 1, 1, 2, 4, 5, 6, 7 ) coreference clustering m 1, m 2, m 3 m 4, m 5 m 6 m 7

14 Coreference Resolution i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link ycoref = ( Columbia, Columbia University, )

15 Named Entity Recognition i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link ycoref = ( Columbia, Columbia University, ) Named Entity Recognition : yner = ( ORG, ORG, ) y i = {ORG, PER, GPE, LOC, FAC, VEL, WEA} Entity Linking: ylink = ( edia.org/wiki/c, ) y i = { olumbia_unive rsity } Columbia_University

16 Entity Linking i = 1 i = 2 He left [Columbia] in 1983 with a BA degree,... after graduating from [Columbia University], he worked as a community organizer in Chicago Coreference: y i = {1, 2 i} co-referent link ycoref = ( Columbia, Columbia University, ) Named Entity Recognition : yner = ( ORG, ORG, ) y i = {ORG, PER, GPE, LOC, FAC, VEL, WEA} Entity Linking: ylink = ( edia.org/wiki/c, ) y i = { olumbia_unive rsity } Columbia_University

17 Single Task Structured Prediction Typical (Single-Task) Structured Prediction: x Input Learning f (x, y) = w ϕ(x, y) f : X, Y R y Scoring function Feature Vector Output Inference ^y = argmax f (x, y) y Intractable in most cases Candidate Methods: Graphical models Structured Perceptron Structured SVM This Work Candidate Methods: Belief Propagation Integer Linear Programming (ILP) Beam Search

18 Structured SVM Learning with Search-based Inference

19 Multi-Task Structured Prediction Multi-Task Structured Prediction (MTSP): Intra-task Features Input x models f 1 : X Y1 = w 1 ϕ 1 (x, y) f 2 : X Y2 = w 2 ϕ 2 (x, y ) f 3 : X Y3 = w 3 ϕ 3 (x, y ) Output o How to exploit the interdependencies between tasks?

20 Multi-Task Structured Prediction Introduce Inter-task Features: Input x Intra-task Features models f 1 : X Y1 = w 1 ϕ 1 (x, y) f 2 : X Y2 = w 2 ϕ 2 (x, y ) f 3 : X Y3 = w 3 ϕ 3 (x, y ) ϕ (1,2) (x, y, y ) ϕ (2,3) (x, y, y ) Output ϕ (1,3) (x, y, y ) Inter-task Features

21 Pipeline Architecture Learning k (= 3) independent models, one after another; Models Predict Output Before Start: w 1 w 2 w (1,2) w 3 w (1,3) w (2,3) Define a order: Task 1 Task 2 Task 3

22 predict train Pipeline Architecture Learning k (= 3) independent models, one after another; Models Predict Output Before Start: w 1 w 2 w (1,2) w 3 w (1,3) w (2,3) Task 1: Use feature ϕ 1 (x, y) x SSVM Learner w 1

23 predict train predict train Pipeline Architecture Learning k (= 3) independent models, one after another; Models Predict Output Before Start: w 1 w 2 w (1,2) w 3 w (1,3) w (2,3) Task 1: Use feature ϕ 1 (x, y) x SSVM Learner w 1 Task 2: Use feature ϕ 2 (x,y), ϕ (1,2) (x,y,y ) SSVM Learner w 2 w (1,2)

24 predict train predict train predict train Pipeline Architecture Learning k (= 3) independent models, one after another; Models Predict Output Before Start: w 1 w 2 w (1,2) w 3 w (1,3) w (2,3) Task 1: Use feature ϕ 1 (x, y) x SSVM Learner w 1 Task 2: Use feature ϕ 2 (x,y), ϕ (1,2) (x,y,y ) SSVM Learner w 2 w (1,2) Task 3: Use feature ϕ 3 (x, y), ϕ (1,3) (x,y,y ) ϕ (2,3) (x,y,y ) SSVM Learner w 3 w (1,3) w (2,3)

25 Pipeline Performance Depends on Task Order Each group of bars represents one task. In each group, we show the accuracy when the task is placed at first (1 st bar), or at last (2 nd and 3 rd bar). The task performs better when it is placed last in order. There is no ordering that allows the pipeline to reach peak performance on all the three tasks.

26 predict train Joint Architecture Task 1 & 2 & 3: Use all features ϕ 1 (x,y), ϕ 2 (x,y), ϕ 3 (x,y), SSVM Learner ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ), ϕ (2,3) (x,y,y ) w 1 w 2 w 3 w (2,3) x w (1,2) w (1,3) predict ϕ = ϕ 1 (x,y) ㅇ ϕ 2 (x,y) ㅇ ϕ 3 (x,y) ㅇ ϕ (1,2) (x,y,y ) ㅇ ϕ (1,3) (x,y,y ) ㅇ ϕ (2,3) (x,y,y ) Vector concatenation Big Problem: Huge branching factor for search

27 Pruning A pruner is a classifier to prune the domain of each variable using state features. Score-agnostic Pruning Training Data Pruner training & predicting Pruned Data SSVM Learner Cost function testing Testing Results Can accelerate the training time; May or may not improve the testing accuracy; Score-sensitive Pruning Training Data SSVM Learner Cost function Pruner training & predicting Testing Results Can improve the testing accuracy; No training speedup, but evaluation does speed up.

28 Cyclic Architecture Pipeline architecture Task 1 Task 2 Task 3 Connect the tail of pipeline to the head?

29 Cyclic Architecture Unshared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs:

30 Cyclic Architecture Unshared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x

31 Cyclic Architecture Unshared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x Use features ϕ 2 (x,y), ϕ (1,2) (x,y,y ), ϕ (2,3) (x,y,y ) x w 2 w (1,2) w (2,3)

32 Cyclic Architecture Unshared-Weight-Cyclic Training Step 1: Step 2: Define a order: Task 1 Task 2 Task 3 Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x Use features ϕ 2 (x,y), ϕ (1,2) (x,y,y ), ϕ (2,3) (x,y,y ) Use features ϕ 3 (x,y), ϕ (1,3) (x,y,y ), ϕ (2,3) (x,y,y ) x Weights are independent x w 2 w (1,2) w (2,3)

33 Cyclic Architecture Shared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: x w 1 w 2 w 3 w (2,3) w (1,2) w (1,3)

34 Cyclic Architecture Shared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x w 1 w 2 w 3 w (2,3) w (1,2) w (1,3)

35 Cyclic Architecture Shared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x Use features ϕ 2 (x,y), ϕ (1,2) (x,y,y ), ϕ (2,3) (x,y,y ) x w 1 w 2 w 3 w (2,3) w (1,2) w (1,3)

36 Cyclic Architecture Shared-Weight-Cyclic Training Step 1: Define a order: Task 1 Task 2 Task 3 Step 2: Predict initial outputs: Use features ϕ 1 (x,y), ϕ (1,2) (x,y,y ), ϕ (1,3) (x,y,y ) x Use features ϕ 2 (x,y), ϕ (1,2) (x,y,y ), ϕ (2,3) (x,y,y ) Use features ϕ 3 (x,y), ϕ (1,3) (x,y,y ), ϕ (2,3) (x,y,y ) x Weights are shared w 1 w 2 w 3 w (2,3) w (1,2) x w (1,3) Task 2 Turn

Experimental Setup Datasets: ACE2005 TAC-KBP2015 ACE-to-Wiki annotation Knowledge Base: Train/Dev/Test 338/144/117 Wikipedia (2015 dump) Train/Dev/Test 132/36/167 Freebase (2014 dump) Evaluation:

37 Experimental Setup Datasets: ACE2005 TAC-KBP2015 ACE-to-Wiki annotation Knowledge Base: Train/Dev/Test 338/144/117 Wikipedia (2015 dump) Train/Dev/Test 132/36/167 Freebase (2014 dump) Evaluation: Coref. NER Linking Within.Coref Cross.Coref NER & Linking MUC BCube CEAFe average CoNLL CoNLL CEAFm NERLC Hamming Hamming Combined accuracy of NER and Linking All metrics are accuracies (larger is better)

38 Results Joint Architecture Performance ACE05 Test Set Performance Train Coreference NER Link Algms. time MUC BCube CEAFe CoNLL Accu. Accu. Berkeley min a. Results of Joint Architecture without Pruning STSP min Joint w Rand Init min Joint w Good init min b. Results of Joint Architecture with Pruning Scoreagnostic min Scoresensitive min c. Results of Cyclic Architecture Unshrd Wt-Cyclic 1. Joint-Good-Init > STSP min Shared Wt-Cyclic min TAC15 Test Set Performance NER Link NERLC Within. Cross. Coref Coref Train. Algm. Accu. Accu. Accu. CoNLL CEAFm time Rank-1st Berkeley m29s a. Results of Joint Architecture without Pruning STSP m41s Joint w. Rand. Ini m19s Joint w. Good. Ini m11s b. Results of Joint Architecture with Pruning Scoreagnostic m15s Scoresensitive m2s c. Results of Cyclic Architecture Ushrd- Wt-Cyc m52s Shard-Wt- Cyc m56s Interdependency, captured by inter-task features, does benefit the system. 2. Joint-Good-Init > Joint-Rand-Init Search-based inference for large structured prediction problems suffers from local optima and is mitigated by a good initialization. 3. Search-based MTSP is competitive or better than the state-of-the-art system.

39 Results Joint Architecture Performance ACE05 Test Set Performance Train Coreference NER Link Algms. time MUC BCube CEAFe CoNLL Accu. Accu. Berkeley min a. Results of Joint Architecture without Pruning STSP min Joint w Rand Init min Joint w Good init min b. Results of Joint Architecture with Pruning Scoreagnostic min Scoresensitive min c. Results of Cyclic Architecture Unshrd Wt-Cyclic Joint-Good-Init > STSP 11min Shared Wt-Cyclic min TAC15 Test Set Performance NER Link NERLC Within. Cross. Coref Coref Train. Algm. Accu. Accu. Accu. CoNLL CEAFm time Rank-1st Berkeley m29s a. Results of Joint Architecture without Pruning STSP m41s Joint w. Rand. Ini m19s Joint w. Good. Ini m11s b. Results of Joint Architecture with Pruning Scoreagnostic m15s Scoresensitive m2s c. Results of Cyclic Architecture Ushrd- Wt-Cyc m52s Shard-Wt- Cyc m56s Interdependency, captured by inter-task features, does benefit the system. 2. Joint-Good-Init > Joint-Rand-Init Search-based inference for large structured prediction problems suffers from local optima and is mitigated by a good initialization. 3. Search-based MTSP is competitive or better than the state-of-the-art system. 4. Score-sensitive pruning of joint MTSP performs the best and takes most time

40 Results Cyclic Architecture Performance ACE05 Test Set Performance Train Coreference NER Link Algms. time MUC BCube CEAFe CoNLL Accu. Accu. Berkeley min a. Results of Joint Architecture without Pruning STSP min Joint w Rand Init min Joint w Good init min b. Results of Joint Architecture with Pruning Scoreagnostic min Scoresensitive min c. Results of Cyclic Architecture Unshard Wt-Cyclic min Shared Wt-Cyclic min TAC15 Test Set Performance NER Link NERLC Within. Cross. Coref Coref Train. Algm. Accu. Accu. Accu. CoNLL CEAFm time Rank-1st Berkeley m29s a. Results of Joint Architecture without Pruning STSP m41s Joint w. Rand. Ini m19s Joint w. Good. Ini m11s b. Results of Joint Architecture with Pruning Scoreagnostic m15s Scoresensitive m2s c. Results of Cyclic Architecture Unshared -Wt-Cyc m52s Shared- Wt-Cyc m56s Competitive accuracy, and much faster training Unshared weights perform better than shared weights

41 Summary 1. Search-based multi-task structured prediction outperforms prior work based on graphical models on all 3 entity analysis tasks. 2. Studied three learning and inference architectures: pipeline, cyclic, and joint, with trade-offs between accuracy and speed. 3. The joint architecture with score-sensitive pruning performs the best. 4. The cyclic architecture with unshared weights is competitive in accuracy and faster to train.

42 Thank You!

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page