End-to-end Automatic Speech Recognition

Size: px

Start display at page:

Download "End-to-end Automatic Speech Recognition"

Ashlee Logan
5 years ago
Views:

1 End-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA Markus Nussbaum-Thom. February 22, 2017 Nussbaum-Thom: IBM Thomas J. Watson Research Center 1 February 22, 2017

2 Contents 1. Introduction 2. Connectionst Temporal Classification (CTC) 3. Attention Model 4. References Nussbaum-Thom: IBM Thomas J. Watson Research Center 2 February 22, 2017

3 Terminology Features: x, x t, x T 1 := x 1,..., x T. Words: w, u, v, w m, w M 1 := w 1,..., w M. Word sequences: W, W n, V. States: s, s t, s T 1 := s 1,..., s T. Class conditional posterior probability: p(s t x t ), p(w, s T 1 x T 1 ). Nussbaum-Thom: IBM Thomas J. Watson Research Center 3 February 22, 2017

4 Bayes Decision Rule Nussbaum-Thom: IBM Thomas J. Watson Research Center 4 February 22, 2017

5 Towards End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 5 February 22, 2017

6 Towards End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 6 February 22, 2017

7 End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 7 February 22, 2017

8 End-to-End Approach End-to-end: Training all modules to optimize a global performance criterion. (LeCun et al., 98) Easy: Classes do not have a sub-structure e.g. image classification. Difficult: Classes have a sub-structure (sequences, graphs) e.g. automatic speech recognition, automatic handwriting recognition, machine translation. Segmentation problem: Which part of the input related to which part of the sub-structure? Nussbaum-Thom: IBM Thomas J. Watson Research Center 8 February 22, 2017

9 Towards End-to-end Automatic Speech Recognition End-to-end acoustic model: Using characters instead of phonemes. Connectionst temporal classification using recurrent or convolutional neural networks. Purely neural attention model. End-to-end feature extraction: Feature extraction integrated into the acoustic model. Using the raw time signal. Learning a specific type of filter. Towards real end-to-end modeling: Using word as targets instead of characters or phonemes and a massive amount of data. Nussbaum-Thom: IBM Thomas J. Watson Research Center 9 February 22, 2017

10 Basic Problem Input: X = x T 1 = (x 1,..., x T ) Neural network: p( x 1 ),..., p( x T ). Target: W = w M 1 = (w 1,..., w M ) but M << T How do we solve this? Connectionist Temporal Classification (CTC). [Graves et al., 2006, Graves et al., 2009, CTC] Attention Models. [Bahdanau et al., 2016, Chorowski et al., 2015, Chorowski et al., 2015, Attention] Inverted Hidden Markov Models. [Doetsch et al., 2016, Inverted HMM - a Proof of Concept] Nussbaum-Thom: IBM Thomas J. Watson Research Center 10 February 22, 2017

11 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 11 February 22, 2017

12 Connectionist Temporal Classification (CTC) Given X = (x 1,..., x 5 ) and W = (a, b, c) Introduce blank state and allow word repititions: x 1 x 2 x 3 x 4 x 5 a b c a b c a a b c..... a b c Blank and repitition removal B: B(a,, b, c,, ) = (a, b, c) Nussbaum-Thom: IBM Thomas J. Watson Research Center 12 February 22, 2017

13 Connectionist Temporal Classification (CTC) Posterior for sentence W = w1 M and features X = x1 T : p(w X ) = p(s1 T X ) s T 1 B 1 (W ) T := p(s t x t ) s1 T B 1 (W ) t=1 Training criterion for training samples (X n, W n ), n = 1,..., N: F CTC (Λ) = 1 N N log p Λ (W n X n ) n=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 13 February 22, 2017

14 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 14 February 22, 2017

15 Forward-Backward Decomposition (CTC) α(t, m, v) : Sum over s t 1 B(w m 1 ) for given x t 1 ending in v. β(t, m, v) : Sum over s T t B(w M m ) for given x T t starting in v. Choose t 1,..., T : p(w M 1 x T 1 ) = =... M = m=1 v {w m, } s T 1 B 1 (w M 1 ) p(s T 1 x T 1 ) α(t, m, v) p(v x t ) p(v x t ) β(t, m, v) p(v x t ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 15 February 22, 2017

16 Forward Algorithm (CTC) Nussbaum-Thom: IBM Thomas J. Watson Research Center 16 February 22, 2017

17 Forward Algorithm (CTC) Compute α(1, m, v) = p(v x 1 ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 17 February 22, 2017

18 Forward Algorithm (CTC) Compute α(2, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 18 February 22, 2017

19 Forward Algorithm (CTC) Compute α(3, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 19 February 22, 2017

20 Forward Algorithm (CTC) Compute α(4, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 20 February 22, 2017

21 Forward Algorithm (CTC) Compute α(5, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 21 February 22, 2017

22 Forward Algorithm (CTC) Compute α(6, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 22 February 22, 2017

23 Forward Algorithm (CTC) Compute α(7, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 23 February 22, 2017

24 Forward Algorithm (CTC) Compute α(8, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 24 February 22, 2017

25 Forward Algorithm (CTC) Compute α(9, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 25 February 22, 2017

26 Forward Algorithm (CTC) Compute α(10, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 26 February 22, 2017

27 Forward Algorithm (CTC) Compute α(11, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 27 February 22, 2017

28 Forward Algorithm (CTC) Compute α(12, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 28 February 22, 2017

29 Backward Algorithm (CTC) Nussbaum-Thom: IBM Thomas J. Watson Research Center 29 February 22, 2017

30 Backward Algorithm (CTC) Compute β(12, M, v) = p(v x 1 2) Nussbaum-Thom: IBM Thomas J. Watson Research Center 30 February 22, 2017

31 Backward Algorithm (CTC) Compute β(11, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 31 February 22, 2017

32 Backward Algorithm (CTC) Compute β(10, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 32 February 22, 2017

33 Backward Algorithm (CTC) Compute β(9, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 33 February 22, 2017

34 Backward Algorithm (CTC) Compute β(8, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 34 February 22, 2017

35 Backward Algorithm (CTC) Compute β(7, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 35 February 22, 2017

36 Backward Algorithm (CTC) Compute β(6, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 36 February 22, 2017

37 Backward Algorithm (CTC) Compute β(5, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 37 February 22, 2017

38 Backward Algorithm (CTC) Compute β(4, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 38 February 22, 2017

39 Backward Algorithm (CTC) Compute β(3, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 39 February 22, 2017

40 Backward Algorithm (CTC) Compute β(2, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 40 February 22, 2017

41 Backward Algorithm (CTC) Compute β(1, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 41 February 22, 2017

42 Posterior Decomposition (CTC) Choose t 1,..., T : p(w1 M X ) = p(s1 T X ) s1 T B 1 (w1 M) = = = s1 T B 1 (w1 M) τ=1 M T p(s τ x τ ) m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v M T p(s τ x τ ) t 1 m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v definition model assumption decomposition around t T p(s τ x τ ) p(s v x t ) p(s ρ x ρ ) ρ=t+1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 42 February 22, 2017

43 Posterior Decomposition (CTC) = = M t 1 m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v M m=1 v {w m, } s1 T B 1 (w1 M s ) t=v p(s τ x τ ) p(s v x t ) t p(s τ x τ ) τ=1 p(v x t ) p(v x t ) T ρ=t+1 p(s ρ x ρ ) T p(s ρ x ρ ) ρ=t p(v x t ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 43 February 22, 2017

44 Posterior Decomposition (CTC) = = M m=1 v {w m, } M m=1 v {w m, } s t 1 B 1 (w m 1 ) s t=v α(t, m, v) p(v x t ) t p(v x τ ) τ=1 p(v x t ) p(v x t ) p(v x t ) β(t, m, v) p(v x t ) s T t B 1 (w M m ) s t=v T p(s ρ x ρ ) ρ=t p(v x t ) α(t, m, v) : Sum over s t 1 B(w m 1 ) for given x t 1 ending in v. β(t, m, v) : Sum over s T t B(w M m ) for given x T t starting in v. Nussbaum-Thom: IBM Thomas J. Watson Research Center 44 February 22, 2017

45 Forward Path Decomposition (CTC) Consider a path s t 1 B 1 (w m 1 ), s t = v: st = w m s t 1 1 s t 1 s t w 1... w?? w m w 1... w m w m w m w 1... w m 1 w m w 1... w m 1 w m 1 w m st = s t 1 1 s t 1 s t w 1... w?? w 1... w m w m w 1... w m w m Nussbaum-Thom: IBM Thomas J. Watson Research Center 45 February 22, 2017

46 Forward Probablities (CTC) α(t, m, v) = + = p(v x t ) s1 t B 1 (w1 m w ) τ=1 m=v t p(s τ x τ ) t 1 u {w m 1, } s t 1 1 B 1 (w m 1 1 ) τ=1 s t 1 =u t 1 s t 1 1 B 1 (w1 m s ) τ=1 t 1 =w m p(s τ x τ ), t 1 u {w m, } s t 1 1 B 1 (w1 m s ) τ=1 t 1 =u p(s τ x τ ) p(s τ x τ ), v = w m v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 46 February 22, 2017

47 Forward Probabilities (CTC) = p(v x t ) α(t 1, m 1, u) + α(t 1, m, w m ) u {w m 1, } u {w m, } α(t 1, m, u), v = w m, v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 47 February 22, 2017

48 Backward Probablities (CTC) β(t, m, v) = Sum over all pathes st T starting in a word v. B(w M m ) for given x T t T β(t, m, v) = p(s τ x τ ) p(v x t ) = p( x t ) st T B 1 (wm M ) τ=t w m=v u {w m+1, } u {w m, } β(t + 1, m + 1, u) + β(t + 1, m, w m ), v = w m β(t + 1, m, u), v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 48 February 22, 2017

49 Derivatives CTC Derivative posterior: p(s xt)p(w X ) M = p(s xt) = M m=1 v {w m, } m=1 v {w m, } α(t, m, v) p(v x t ) p(v x t ) α(t, m, s) β(t, m, s) δ(v, s) p 2 (s x t ) β(t, m, v) p(v x t ) log P(W X ) = Derivative training criterion: F CTC (Λ) = 1 N 1 P(W X ) P(W X ) N log p(w n X n ) Nussbaum-Thom: IBM Thomas J. Watson Research Center n=1 49 February 22, 2017

50 CTC Architectures What kind of encoders? DNNs, (bidirectional) LSTMs, CNNs. Subsampling: Reducing framerate through the network. Nussbaum-Thom: IBM Thomas J. Watson Research Center 50 February 22, 2017

51 Subsampling Join input frames. Reshape input to next layer: Return every 2nd frame to the next layer. CNNs: Use strides. Nussbaum-Thom: IBM Thomas J. Watson Research Center 51 February 22, 2017

52 Peaking Behavior [Graves and Jaitly, 2014, citation] Nussbaum-Thom: IBM Thomas J. Watson Research Center 52 February 22, 2017

53 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 53 February 22, 2017

54 Hybrid Recognition (CTC) Hybrid: Model: p(x s) p(s x) p(s) = p(s x) p(x)p(s) = Decoding: ŵ N 1 = arg max w N 1 { p(w N 1 ) max s T 1 } T p(x t s t ) τ=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 54 February 22, 2017

55 Weighted Finite State Transudcer Recognition (WFST) Token T : Language Model G: Lexicon L: Search Space: S = T min(det(l G)) Nussbaum-Thom: IBM Thomas J. Watson Research Center 55 February 22, 2017

56 Resources for CTC Keras: Tensorflow: master/keras/backend/tensorflow_backend.py Theano: master/keras/backend/theano_backend.py Example: master/examples/image_ocr.py Baidu: https: //github.com/baidu-research/ba-dls-deepspeech Eesen: Kaldi: Nussbaum-Thom: IBM Thomas J. Watson Research Center 56 February 22, 2017

57 Further Literature on CTC [Miao et al., 2015, EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding] [Collobert et al., 2016, Wav2Letter: an End-to-End ConvNet-based Speech Recognition System] [Zhang et al., 2017, Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks] [Senior et al., 2015, Acoustic modelling with CD-CTC-SMBR LSTM RNNS] [Soltau et al., 2016, Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition] Nussbaum-Thom: IBM Thomas J. Watson Research Center 57 February 22, 2017

58 Attention Model Encoder-Decoder architecture: Encoder: performs a feature extracation/encoding based on the input. Decoder: Produces output sequence output labels from the encoded features. Nussbaum-Thom: IBM Thomas J. Watson Research Center 58 February 22, 2017

59 Attention Encoder-Decoder Architecture What kind of encoders? DNNs, (bidirectional) LSTMs, CNNs. output decoder MLP glimpse encoder Nussbaum-Thom: IBM Thomas J. Watson Research Center 59 February 22, 2017

60 Attention Encoder-Decoder Architecture Encoder-Decoder: Input: x T 1 = x 1,..., x T Encoder: h T /4 1 = h 1,..., h T /4 = Encoder(x1 T ) Decoder: For m = 1,..., M: Attention: α m = Attend(h T /4 1, s m 1, α m 1) T /4 Glimpse: g m = τ=1 α m,th t Generator: y m = Generator(g m, s m 1) c m = RNN(c m 1, g m, s m 1) y m = Softmax(c m) Transition: s m = RNN(s m 1, y m, g m) Nussbaum-Thom: IBM Thomas J. Watson Research Center 60 February 22, 2017

61 Attention Mechanism Content based: (weights: E, W, V and bias: b) ɛ m,t = E tanh(w s m 1 + V h t + b) Location based: (weights: E, W, V, U and bias: b) f = F α m 1 ɛ m,t = E tanh(w s m 1 + V h t + U f m,t + b) Renormalization: (sharpening: γ) α m,t = T /4 exp(γ ɛ m,t) exp(γ ɛ m,t ) t=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 61 February 22, 2017

62 Window Around Median Compute median: τ m = arg min k=1,...,t /4 k α m 1,ρ ρ=1 Compute attention around median: k θ=k+1 α m 1,θ T m = {τ m ω left,..., τ m + ω right } exp(γ ɛ m,t ), t T m exp(γ ɛ α m,t = m,τ ) τ T m 0, otherwise Nussbaum-Thom: IBM Thomas J. Watson Research Center 62 February 22, 2017

63 Other Techniques (Attention) Monotonic regularization: ( T /4 r m = max 0, τ α m,i τ=1 i=1 i=1 τ ) α m 1,i Curriculum learning: Starting with shorter sequences and gradually increase sequence length. Flatstart: Initial positions are chosen according to speaker speed. Nussbaum-Thom: IBM Thomas J. Watson Research Center 63 February 22, 2017

64 Resources for Attention Theano+Bricks+Blocks: Tensorflow: Keras: Nussbaum-Thom: IBM Thomas J. Watson Research Center 64 February 22, 2017

65 Further Literature on Attention [Bahdanau et al., 2016, End-to-end attention-based large vocabulary speech recognition] [Chorowski et al., 2015, Attention-Based Models for Speech Recognition] [Chorowski et al., 2015, End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results] [Kim et al., 2016, Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning] Nussbaum-Thom: IBM Thomas J. Watson Research Center 65 February 22, 2017

66 Evaluation Framework Development and evaluation set different from training set. Levenshtein: Minimum insertions, deletions and substitutions Spoken Referenz Alignment Recognized Hypothese S P E A K E R T E A C H E R correct substitution insertion deletion Nussbaum-Thom: IBM Thomas J. Watson Research Center 66 February 22, 2017

67 Evaluation Framework Levenshtein: Minimum insertions, deletions and substitutions { λ L(w1 N, v1 M ( ) = min 1 δ(ws(i), v t(i) ) )} s,t i=1 { 1, v = w with dem Kronecker delta δ(w, v) = 0, v w Word Error Rate (WER): WER(Spoken R 1, Recognized R 1 ) = R L(Spoken r, Recognized r ) r=1 R Spoken r r=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 67 February 22, 2017

68 Experimental Results Model CER WER Bhadanau et al. (2015) Attention Attention + bigram LM Attention + trigram LM Attention + extended trigram LM Graves and Jaitly (2014) CTC Hannun et al. (2014) CTC + bigram LM n/a 14.1 Miao et al. (2015) CTC + bigram LM n/a 26.9 CTC for phonemes + lexicon n/a 26.9 CTC for phonemes + trigram LM n/a 7.3 CTC + trigram LM n/a 9.0 Hybrid BGRU (15 h) n/a 2.0 Nussbaum-Thom: IBM Thomas J. Watson Research Center 68 February 22, 2017

69 Attention Modeling Example from Handwriting Nussbaum-Thom: IBM Thomas J. Watson Research Center 69 February 22, 2017

70 Inverted Hidden Markov Model (HMM) Traditional HMM: p(w N 1, x T 1 ) = p(w N 1 ) p(x T 1 w N 1 ) = p(w1 N ) s1 T N = p(w n w1 n 1 ) n=1 s1 T Inverted HMM: p(w N 1 x T 1 ) = t N 1 = t N 1 p(s T 1, x T 1 w N 1 ) T t=1 p(w N 1, t N 1 x T 1 ) N n=1 p(s t, x t s t 1 1, x t 1 1, w N 1 ) p(w n, t n w n 1 1, t n 1 1, x T 1 ) [Doetsch et al., 2016, Inverted HMM - a Proof of Concept] Nussbaum-Thom: IBM Thomas J. Watson Research Center 70 February 22, 2017

71 Unsolved Problems for End-to-End ASR Error rates: Still higher than traditional HMM-based system (one exception). Global search: Still a transducer-based or HMM-based search. Acoustic model: Word and character-based End-to-End learning. Language model: No integration with the language model in training yet. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

72 Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages Collobert, R., Puhrsch, C., and Synnaeve, G. (2016). Wav2letter: an end-to-end convnet-based speech recognition system. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

73 CoRR, abs/ Doetsch, P., Hegselmann, S., SchlÃ 1 4ter, R., and Ney, H. (2016). Inverted hmm - a proof of concept. In Neural Information Processing Systems Workshop, Barcelona, Spain. Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML 06, pages , New York, NY, USA. ACM. Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

74 In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, June 2014, pages Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell., 31(5): Kim, S., Hori, T., and Watanabe, S. (2016). Joint ctc-attention based end-to-end speech recognition using multi-task learning. CoRR, abs/ Miao, Y., Gowayyed, M., and Metze, F. (2015). EESEN: end-to-end speech recognition using deep RNN models and wfst-based decoding. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

75 In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pages Senior, A. W., Sak, H., de Chaumont Quitry, F., Sainath, T. N., and Rao, K. (2015). Acoustic modelling with CD-CTC-SMBR LSTM RNNS. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pages Soltau, H., Liao, H., and Sak, H. (2016). Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition. CoRR, abs/ Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., and Courville, A. C. (2017). Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

76 Towards end-to-end speech recognition with deep convolutional neural networks. CoRR, abs/ Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave