Learning to translate with neural networks. Michael Auli
|
|
- Amice Tamsin Harvey
- 5 years ago
- Views:
Transcription
1 Learning to translate with neural networks Michael Auli 1
2 Neural networks for text processing Similar words near each other France Spain dog cat
3 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts dog cat France Spain
4 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts Traditional discrete models treat each word separately dog cat France Spain
5 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling PPLX (Mikolov 2011)
6 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling Machine Translation PPLX (Mikolov 2011) This talk Le (2012), Kalchbrenner (2013), Devlin (2014), Sutskever (2014), Cho (2014),
7 What happened in MT over the past 10 years?
8 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013)
9 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013) WMT /10 times
10 Machine translation development and progress of the region.
11 Machine translation Translation modeling development and progress of the region.
12 Machine translation Translation modeling Language modeling development and progress of the region.
13 Machine translation Translation modeling Language modeling development and progress of the region. Optimization
14 Machine translation Translation modeling Language modeling development and progress of the region. Optimization Reordering
15 Machine translation Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014
16 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.
17 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.
18 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.
19 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region. of the region development and progress
20 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length
21 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length
22 Discrete phrase-based translation Koehn et al. (2003) 70% Average in train: 2.3 words Average in test: 1.5 words 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length
23 Kneser & Ney (1996) Discrete n-gram language modeling p(progress in the region) =
24 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = Train data: development and progress of the region. in
25 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% n-gram order Does not include out-of-vocabulary tokens
26 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Does not include out-of-vocabulary tokens
27 How can we improve this? Or: how to capture relationships beyond 1.5 to 2.7 words? Neural nets: distributional representations make it easier to capture relationships Recurrent nets: easy to model variable-length sequences dog cat France Spain
28 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014
29 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014
30 e.g. bigram LM Feed-forward network
31 Feed-forward network e.g. bigram LM and
32 e.g. bigram LM Feed-forward network U h t = (Ux t ) 1 (z) = 1+exp{ z} and
33 e.g. bigram LM Feed-forward network V U y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 h t = (Ux t ) 1 (z) = 1+exp{ z} and
34 e.g. bigram LM Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 U h t = (Ux t ) 1 (z) = 1+exp{ z} and
35 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}
36 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Still based on limited context! Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}
37 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) U and
38 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) Dependence on previous time step W U and
39 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U U and
40 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U W U and
41 <s> Recurrent network
42 Recurrent network <s> p(development <s>)
43 Recurrent network <s> development
44 Recurrent network <s> development
45 Recurrent network <s> development p(and development,<s>)
46 Recurrent network <s> development and
47 Recurrent network <s> development and
48 Recurrent network <s> development and p(progress development,and, <s>)
49 Recurrent network <s> development and progress
50 Recurrent network History of inputs up to current time-step <s> development and progress
51 Recurrent network History of inputs up to current time-step <s> development and progress State of the art in language modeling (Mikolov 2011) More accurate than feed-forward nets (Sundermeyer 2013)
52 Combined language and translation model Auli et al., EMNLP 2013
53 Combined language and translation model Auli et al., EMNLP 2013
54 Combined language and translation model Auli et al., EMNLP 2013
55 Combined language and translation model s Auli et al., EMNLP 2013 Entire source sentence representation
56 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s>
57 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development
58 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development
59 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and
60 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and
61 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and progress
62 Combined language and translation model h t = (Ux t + Wh t 1 + Fs)
63 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window
64 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s>
65 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development
66 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) Source word-window <s> development
67 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development
68 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development
69 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and
70 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and
71 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and
72 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress
73 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress Le (2012), Kalchbrenner (2013), Devlin (2014), Cho (2014), Sutskever (2014)
74 Experimental setup WMT 2012 French-English translation task Data: 100M words Baseline: Phrase-based model similar to Moses Rescoring Mini-batch gradient descent Class-structured output layer (Goodman, 1996) 15
75 Does the neural model learn to translate? -Discrete translation model WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring
76 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring
77 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring
78 Improving a phrase-based baseline Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring
79 Improving a phrase-based baseline +Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring
80 Combining recurrent nets with discrete models Auli & Gao, ACL 2014
81 Combining recurrent nets with discrete models Auli & Gao, ACL 2014 BLEU Phrase-based n-best rescore lattice rescore decoder integration
82 Combining recurrent nets with discrete models Auli & Gao, ACL BLEU Phrase-based n-best rescore lattice rescore decoder integration
83 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order
84 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words Neural models more robust when discrete models rely on sparse estimates? n-gram order
85 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e)
86 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = i=1 X10 i=6 h i (f,e)
87 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = Standard optimizer (MERT) to find weights for each n-gram order i=1 X10 i=6 h i (f,e)
88 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Optimzer: MERT
89 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram order Optimzer: MERT
90 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight less important n-gram order Optimzer: MERT
91 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT
92 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT
93 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT
94 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT n-gram model has higher-order statistics
95 Recurrent Minimum Translation Unit Models development and progress of the region
96 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region
97 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of
98 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of n-gram models over MTUs:! p(m1) p(m2 M1) p(m3 M1,M2) Banchs et al. (2005), Quirk & Menezes (2006)
99 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s>
100 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development
101 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development
102 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and
103 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and
104 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and progress
105 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 Singletons <s> development and progress
106 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>
107 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>
108 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>
109 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development
110 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development
111 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development
112 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and
113 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and
114 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and
115 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and
116 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and progress
117 Recurrent Minimum Translation Unit Models
118 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)
119 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)
120 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014
121 Back propagation with cross entropy error Model distribution (y) delayed
122 Back propagation with cross entropy error True distribution (t) Model distribution (y) delayed
123 Back propagation with cross entropy error True distribution (t) Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely
124 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely
125 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely
126 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely
127 Back propagation through time <s> development and
128 Back propagation through time <s> development and
129 Back propagation through time <s> development and
130 Back propagation through time <s> development and
131 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011)
132 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011) Next: Task-specific training of neural nets for translation
133 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002)
134 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty
135 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region
136 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region
137 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region
138 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region
139 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) L: max p(ẽ f; )
140 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; )
141 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; ) xbleu: max X e2e(f) sbleu(e, ẽ)p(e f; ) Generated outputs e.g. n-best, lattice Gain function Model probability
142 Expected BLEU Training Human:! development and progress of the region
143 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region
144 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu
145 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )
146 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )
147 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )
148 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )
149 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )
150 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )
151 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )
152 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )
153 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)
154 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)
155 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP 2014
156 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP % % +0.7 BLEU % 26.0% news2011% % 24.5 Baseline 4.5K 9K 900K 3M Features 25.6% 2K% 4K% 8K% 16K% 32K% 64K% 136K%272K% Training set size
157 My other neural network projects Social media response generation with RNNs building neural net-based conversational agents based on twitter conversations Semi-supervised phrase table expansion with word embeddings using distributional word and phrase representations and by mapping between distributional source and target spaces with RBVs CCG parsing & tagging with RNNs
158 Semantic CCG Parsing Zettlemoyer (2005, 2007), Bos (2008), Kwiatkowski (2010, 2013) Krishnamurthy (2012), Lewis (2013a,b) Marcel proved completeness NP : marcel (S\NP)/NP : λy.λx.proved(x,y) NP : completeness S\NP : λx.proved(x, completeness) S : proved(marcel, completeness) Combinatory Categorial Grammar (CCG; Steedman 2000)
159 How is this useful? Marcel proved completeness NP : marcel (S\NP)/NP : NP : λy.λx.proved(x,y) completeness S\NP : λx.proved(x, completeness) Answer S : proved(marcel, completeness) User query (Semantic) parsing Knowledge Base
160 How is this useful? Marcel proved completeness Parse failures and lexical ambiguity are a NP : (S\NP)/NP : NP : marcel λy.λx.proved(x,y) completeness major source of errors in semantic parsing S\NP : λx.proved(x, completeness) (Kwiatkowski 2013) S : proved(marcel, completeness) Answer User query (Semantic) parsing Knowledge Base
161 Integrated Parsing & Tagging with belief propagation, dual decomposition and softmax-margin training Auli & Lopez ACL 2011a,b Auli & Lopez EMNLP 2011 parsing factor p i,j = 1 Z f i,je i,j b i,j o i,j inside-outside supertagging factor forward-backward Marcel proved completeness
162 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
163 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
164 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
165 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
166 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
167 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)
168 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP
169 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP tagging parsing CRF (Clark & Curran 07) FFN (Lewis & Steedman, 14) RNN
170 Two RNN translation models Summary Neural nets help most when discrete models sparse Task-specific objective gives best performance Next: Better modeling of source-side, e.g., bi-directional RNNs, different architectures
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationQuasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti
Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationa) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM
c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of
More informationThe Geometry of Statistical Machine Translation
The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide
More informationMulti-Source Neural Translation
Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder
More informationMulti-Source Neural Translation
Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source
More informationTuning as Linear Regression
Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical
More informationVariational Decoding for Statistical Machine Translation
Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1
More informationTALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto
TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationPhrase-Based Statistical Machine Translation with Pivot Languages
Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationarxiv: v1 [cs.cl] 22 Jun 2015
Neural Transformation Machine: A New Architecture for Sequence-to-Sequence Learning arxiv:1506.06442v1 [cs.cl] 22 Jun 2015 Fandong Meng 1 Zhengdong Lu 2 Zhaopeng Tu 2 Hang Li 2 and Qun Liu 1 1 Institute
More informationMachine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26
Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationDiscriminative Training
Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationMinimum Error Rate Training Semiring
Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)
More informationLecture 15: Recurrent Neural Nets
Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationToday s Lecture. Dropout
Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationDeep Learning For Mathematical Functions
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationCS230: Lecture 10 Sequence models II
CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech
More informationA Discriminative Model for Semantics-to-String Translation
A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationMarrying Dynamic Programming with Recurrent Neural Networks
Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying
More informationRecurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationDeep learning for Natural Language Processing and Machine Translation
Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural
More informationImproving Sequence-to-Sequence Constituency Parsing
Improving Sequence-to-Sequence Constituency Parsing Lemao Liu, Muhua Zhu and Shuming Shi Tencent AI Lab, Shenzhen, China {redmondliu,muhuazhu, shumingshi}@tencent.com Abstract Sequence-to-sequence constituency
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationDeep Learning. Language Models and Word Embeddings. Christof Monz
Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationperplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)
Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More
More informationCS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention
CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationstatistical machine translation
statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationExploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
Journal of Machine Learning Research 12 (2011) 1-30 Submitted 5/10; Revised 11/10; Published 1/11 Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation Yizhao
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationDiscriminative Training. March 4, 2014
Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationLanguage Model. Introduction to N-grams
Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction
More informationTask-Oriented Dialogue System (Young, 2000)
2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationDecoding and Inference with Syntactic Translation Models
Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP
More informationConditional Language modeling with attention
Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationAnalysis of techniques for coarse-to-fine decoding in neural machine translation
Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationarxiv: v1 [cs.cl] 21 May 2017
Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationLanguage Model Rest Costs and Space-Efficient Storage
Language Model Rest Costs and Space-Efficient Storage Kenneth Heafield Philipp Koehn Alon Lavie Carnegie Mellon, University of Edinburgh July 14, 2012 Complaint About Language Models Make Search Expensive
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNatural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale
Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online
More informationConsensus Training for Consensus Decoding in Machine Translation
Consensus Training for Consensus Decoding in Machine Translation Adam Pauls, John DeNero and Dan Klein Computer Science Division University of California at Berkeley {adpauls,denero,klein}@cs.berkeley.edu
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationFeature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive
More informationA Supertag-Context Model for Weakly-Supervised CCG Parser Learning
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith U. Washington CMU UT-Austin CMU Contributions 1. A new generative model for learning
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationLow-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park
Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationFast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation
Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science
More informationMACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationNatural Language Processing (CSEP 517): Machine Translation
Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationMultiword Expression Identification with Tree Substitution Grammars
Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More information