Discriminative Training

Size: px

Start display at page:

Download "Discriminative Training"

Egbert O’Brien’
6 years ago
Views:

1 Discriminative Training February 19, 2013

2 Noisy Channels Again p(e) source English

3 Noisy Channels Again p(e) p(g e) source English German

4 Noisy Channels Again p(e) p(g e) source English German decoder e = arg max e p(e g) = arg max e p(g e) p(g) p(e) = arg max e p(g e) p(e)

5 Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e)

6 Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

7 Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

8 Noisy Channels Again e = arg max e = arg max e p(e g) p(g e) p(g) p(e) = arg max p(g e) p(e) e Does this = arglook max log p(g familiar? e) + log p(e) e = arg max e apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

9 The Noisy Channel -log p(g e) -log p(e)

10 As a Linear Model -log p(g e) ~w -log p(e)

11 As a Linear Model -log p(g e) ~w -log p(e)

12 As a Linear Model -log p(g e) ~w -log p(e)

13 As a Linear Model -log p(g e) Improvement 1: change ~w to find better ~w translations g -log p(e)

14 As a Linear Model -log p(g e) ~w -log p(e)

15 As a Linear Model -log p(g e) ~w -log p(e)

16 As a Linear Model -log p(g e) ~w -log p(e)

17 As a Linear Model -log p(g e) Improvement 2: Add dimensions to make points separable ~w -log p(e)

18 Linear Models e = arg max e w > h(g, e) Improve the modeling capacity of the noisy channel in two ways Reorient the weight vector Add new dimensions (new features) Questions What features? h(g, e) How do we set the weights? w

19 Mann beißt Hund 18

20 Mann beißt Hund x BITES y 18

21 Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

22 Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

23 Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

24 Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

25 Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

26 Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution 20

27 Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution Configurational Are semantic/syntactic relations preserved? Dog bites man vs. Man bites dog 20

28 Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution Configurational Are semantic/syntactic relations preserved? Dog bites man vs. Man bites dog Grammatical Is the output fluent / well-formed? Man bites dog vs. Man bite dog 20

29 What do lexical features look like? Mann beißt Hund man bites cat 21

30 What do lexical features look like? Mann beißt Hund man bites cat 21

31 What do lexical features look like? Mann beißt Hund man bites cat First attempt: score(g, e) =w > h(g, e) ( 1, 9i, j : g i = Hund,e j = cat h 15,342 (g, e) = 0, otherwise 21

32 What do lexical features look like? Mann beißt Hund man bites cat First attempt: score(g, e) =w > h(g, e) ( 1, 9i, j : g i = Hund,e j = cat h 15,342 (g, e) = 0, otherwise But what if a cat is being chased by a Hund? 21

33 What do lexical features look like? Mann beißt Hund man bites cat Latent variables enable more precise features: score(g, e, a) =w > h(g, e, a) X ( 1, if g i = Hund,e j = cat h 15,342 (g, e, a) = 0, otherwise (i,j)2a 22

34 Standard Features Target side features log p(e) [ n-gram language model ] Number of words in hypothesis Non-English character count Source + target features log relative frequency e f of each rule [ log #(e,f) - log #(f) ] log relative frequency f e of each rule [ log #(e,f) - log #(e) ] lexical translation log probability e f of each rule [ log pmodel1(e f) ] lexical translation log probability f e of each rule [ log pmodel1(f e) ] Other features Count of rules/phrases used Reordering pattern probabilities 23

35 Parameter Learning 24

36 Hypothesis Space h 1 h 2 25

37 Hypothesis Space h 1 Hypotheses h 2 25

38 Hypothesis Space h 1 References h 2 26

39 Preliminaries We assume a decoder that computes: he, a i = arg max he,ai w> h(g, e, a) And K-best lists of, that is: { e i, a i } i=k i=1 = arg i th - max he,ai w> h(g, e, a) Standard, efficient algorithms exist for this. 27

40 Learning Weights Try to match the reference translation exactly Conditional random field Maximize the conditional probability of the reference translations Average over the different latent variables Max-margin Find the weight vector that separates the reference translation from others by the maximal margin Maximal setting of the latent variables 28

41 Learning Weights Try to match the reference translation exactly Conditional random field Maximize the conditional probability of the reference translations Average over the different latent variables Max-margin Find the weight vector that separates the reference translation from others by the maximal margin Maximal setting of the latent variables 28

42 Problems These methods give full credit when the model exactly produces the reference and no credit otherwise What is the problem with this? There are many ways to translate a sentence What if we have multiple reference translations? What about partial credit? 29

43 Problems These methods give full credit when the model exactly produces the reference and no credit otherwise What is the problem with this? There are many ways to translate a sentence What if we have multiple reference translations? What about partial credit? 29

44 Cost-Sensitive Training Assume we have a cost function that gives a score for how good/bad a translation is (ê, E) [0, 1] Optimize the weight vector by making reference to this function We will talk about two ways to do this 30

45 K-Best List Example h 1 ~w h 2 31

46 K-Best List Example h 1 ~w #8 #7 #3 #6 #5 #4 #2#1 #10 #9 h 2 31

47 K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < apple < apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 32

Pick random pair of hypotheses (A,B) from K-best list Use cost function to

48 Training as Classification Pairwise Ranking Optimization Reduce training problem to binary classification with a linear model Algorithm For i=1 to N Pick random pair of hypotheses (A,B) from K-best list Use cost function to determine if is A or B better Create ith training instance Train binary linear classifier 33

49 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 34

50 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 34

51 h 1 #6 #3 #2 #1 Worse! 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 35

52 h 1 #6 #3 #2 #1 Worse! 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 36

53 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 37

54 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 Better! 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 38

55 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 Better! 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 39

56 h 1 #6 #3 #2 #1 Worse! 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 40

57 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 Better! #10 #9 h 2 h 1 h 2 41

58 h 1 #6 #3 #2 #1 0.8 apple < apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 h 1 h 2 42

59 h 1 h 2 Fit a linear model 43

60 h 1 h 2 ~w Fit a linear model 44

61 K-Best List Example h 1 #3 #2#1 0.8 apple < 1.0 #6 #5 #4 0.6 apple < apple < 0.6 ~w #8 #7 0.2 apple < apple < 0.2 #10 #9 h 2 45

62 MERT Minimum Error Rate Training Directly target an automatic evaluation metric BLEU is defined at the corpus level MERT optimizes at the corpus level Downsides Does not deal well with > ~20 features 46

63 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 47

64 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 48

65 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 49

66 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 50

67 MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 51

68 Given weight vector MERT w will have a (scalar) score, any hypothesis he, ai m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b = a + b + v > h(g, e, a) {z } a m Linear function in 2D! 51

69 MERT m 52

70 MERT m Recall our k-best set { e i, a i } K i=1 53

71 MERT m Recall our k-best set { e i, a i } K i=1 54

72 MERT m 55

73 MERT he 162, a 162i m he 28, a 28i he 73, a 73i 56

74 MERT he 162, a 162i m he 28, a 28i he 73, a 73i 57

75 MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 58

76 MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 59

77 MERT m errors 60

78 MERT m errors 61

79 errors Let w new = v + w 62

80 MERT In practice errors are sufficient statistics for evaluation metrics (e.g., BLEU) Can maximize or minimize! Envelope can also be computed using dynamic programming Interesting complexity bounds How do you pick the search direction? 63

81 Summary Evaluation metrics Figure out how well we re doing Figure out if a feature helps But ALSO: train your system! What s a great way to improve translation? Improve evaluation! 64

82 Thank You! m he 162, a 162i he 28, a 28i he 73, a 73i 65

Discriminative Training. March 4, 2014

Discriminative Training. March 4, 2014 Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =