Translator

Size: px

Start display at page:

Download "Translator"

Marsha Wells
5 years ago
Views:

1 Translator

2 Marian Rejewski

3 A few words about Marian Portable C++ code with minimal dependencies (CUDA or MKL and still Boost); Single engine for training and decoding on GPU and CPU; Custom auto-diff engine with dynamic graphs (similar to DyNet); Optimized towards NMT and

4 Part I A Machine Translation Marathon 2016 project

5 The first commit Commit: 6a7c93 Date: May 4th, 2016 Message: very cool Lines: 155

6 # include <iostream > # include "madh" int main(int argc, char** argv) { Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; std::vector <Var > x = { x0, x1, x2 }; set_zero_all_adjoints (); ygrad(); } std:: cerr << "y = " << yval() << std:: endl; for(int i = 0; i < xsize(); ++i) std:: cerr << " dy/ dx_" << i << " = " << x[i] adj() << std:: endl;

7 Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; y = dy/dx_0 = 2 dy/dx_1 = 1 dy/dx_2 =

8 45000 Lines of code over time

9 45000 Lines of code over time First commit 0

10 45000 Lines of code over time MTM First commit 0

11 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz

12 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz

13 Why create another NN toolkit? Flexibility Add functionality easier & faster Speed Pure C++ implementation GPU-enabled (CPU may come soon) Learn about Deep Learning Implement everything from scratch by ourselves

14 What we ve achieved this week? Framework to create computation graphs Simple feedforward NN RNN, GRU, LSTM Binary or multiclass classifier Forward step Classify, given input data and weights Backward step Learn weights using backpropagation Tested with small datasets MNIST (digit image recognition task) MT Documentation Doxygen

15 + - const const const param const const param const const param const const param const const const input log softmax + const + tanh const input

16 What needs to be finished? Basic features: Data shuffling More random distributions Model serialization & deserialization Documentation

17 45000 Lines of code over time MTM First commit 0

18 45000 Lines of code over time Working Nematus Model MTM First commit 0

19 45000 Lines of code over time Working Nematus Model MTM First commit AARA

20 45000 Lines of code over time Working Nematus Model MTM 2016 Multi-gpu Training First commit AARA

21 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA

22 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA MJD leaves/ RG joins UEdin 0

23 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0

24 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0

25 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v First commit AARA MJD leaves/ RG joins UEdin 0

26 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v 100 v 110 Batched decoding First commit AARA MJD leaves/ RG joins UEdin 0

27 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft 0

28 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 0

29 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting CPU backend Restoration 0

30 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization Autotuning 0

31 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization v 160 Autotuning 30% faster NCCL Tied layers 0

32 Going further Reduce dependencies for CPU version to zero Reduce dependencies for GPU version to CUDA Become faster and more versatile Research tool with immediate deployment

40 Part II Decoding on the CPU

41 Quality first speed later Lessons from WNMT shared task on efficient decoding; Sequence-level knowledge distillation (Kim & Rush 2016): Training four Transformer-big models on official task data (teacher); Translate entire EN data to DE-trans (8-best list); Select sentences with highest sentence-level BLEU based on DE-orig data; Train students on EN/DE-trans data

42 Teacher Transformer Big MiB (Scale preserving)

43 Teacher Transformer Big Students Transformer beam=1 Big Base Small MiB MiB MiB MiB 264 (Scale preserving)

44 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist +AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

45 W b x + V=36,000 W' b' x + V'=578

46 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

47 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ]

48 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ] Average attention network (Zhang et al 2018): C = gate(ffn( V), Q) Y = norm(q + C) Gate and FFN optional Complexity per step: O(1) Because: v t = 1 t ((t 1) v t 1 + v t ) Basically a weird RNN Authors report 4x speed-up for beam=4

49 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

50 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 )

51 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q

52 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q x W = x (W T ) T q(x) q(w T ) T = q(x) q(w T )

53 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W

54 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

55 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W

56 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W parameter

57 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant parameter

58 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter

59 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized

60 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

61 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized

62 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)

63 h 1 = hash(dot(x, W)) = hash(dot) hash(dims(x)) hash(dims(w)) = hash(dot) hash({11, 512}) hash({512, 512}) h 2 = hash(dot 16 (x, W)) = hash(dot 16 ) hash(dims(x)) hash(dims(w)) = hash(dot 16 ) hash({11, 512}) hash({512, 512}) We can decrease granularity via integer-dividing dimensions by a given factor, we choose 5

64 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)

65 autotune(x, W) dot(x, W) dot 16 (x, W) measure(h 1 ) measure(h 2 ) memoized x W measure(h 2 ) quantize quantize constant x transpose constant W parameter

66 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

67 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

68 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Marian v Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

69 Transformer-base Shortlist AAN (modified) 2654 using --optimize Caching att keys/values w/o --optimize Marian v Latency per sentence in milliseconds for newstest2014 (batch-size: 1 sentence)

70 Map GPU and CPU performance into comparable space [w/$] newstest2014de consists of 62,954 tokens Type Price [$/h] p3x2large 3259 m5large 0102 [w/$] = 62, 954 [w] Translation time [s] 3, 600 [s/h] Instance price [$/h]

71 Others GPU BLEU Million translated source tokens per USD (log scale)

72 300 Marian GPU Submissions Others GPU BLEU Million translated source tokens per USD (log scale)

73 Others CPU BLEU Million translated source tokens per USD (log scale)

74 300 Marian CPU Submissions Others CPU i i i 16i 15i 17i 23i BLEU i Million translated source tokens per USD (log scale)

75 300 Marian GPU Others GPU Submissions Marian CPU Others CPU i i 13 BLEU i 16i 15i 17i 22 23i i Million translated source tokens per USD (log scale)

76 300 Marian v14 GPU Others GPU Marian v14 CPU Others CPU BLEU Million translated source tokens per USD (log scale)

77 300 Marian v14 GPU Marian v15 GPU Others GPU Marian v14 CPU Marian v15 CPU Others CPU BLEU Million translated source tokens per USD (log scale)

78 Future work More experiments with Teacher-Student scenario; More SIMD operations on the CPU; All operations in fixed-point arithmetics on the CPU; 8-bit matrix product on the CPU; Mixed precision (FP16) on the GPU; Optimize beam-search for batched translation;

79 Translator

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech