Translator

Size: px
Start display at page:

Download "Translator"

Transcription

1 Translator

2 Marian Rejewski

3 A few words about Marian Portable C++ code with minimal dependencies (CUDA or MKL and still Boost); Single engine for training and decoding on GPU and CPU; Custom auto-diff engine with dynamic graphs (similar to DyNet); Optimized towards NMT and

4 Part I A Machine Translation Marathon 2016 project

5 The first commit Commit: 6a7c93 Date: May 4th, 2016 Message: very cool Lines: 155

6 # include <iostream > # include "madh" int main(int argc, char** argv) { Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; std::vector <Var > x = { x0, x1, x2 }; set_zero_all_adjoints (); ygrad(); } std:: cerr << "y = " << yval() << std:: endl; for(int i = 0; i < xsize(); ++i) std:: cerr << " dy/ dx_" << i << " = " << x[i] adj() << std:: endl;

7 Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; y = dy/dx_0 = 2 dy/dx_1 = 1 dy/dx_2 =

8 45000 Lines of code over time

9 45000 Lines of code over time First commit 0

10 45000 Lines of code over time MTM First commit 0

11 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz

12 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz

13 Why create another NN toolkit? Flexibility Add functionality easier & faster Speed Pure C++ implementation GPU-enabled (CPU may come soon) Learn about Deep Learning Implement everything from scratch by ourselves

14 What we ve achieved this week? Framework to create computation graphs Simple feedforward NN RNN, GRU, LSTM Binary or multiclass classifier Forward step Classify, given input data and weights Backward step Learn weights using backpropagation Tested with small datasets MNIST (digit image recognition task) MT Documentation Doxygen

15 + - const const const param const const param const const param const const param const const const input log softmax + const + tanh const input

16 What needs to be finished? Basic features: Data shuffling More random distributions Model serialization & deserialization Documentation

17 45000 Lines of code over time MTM First commit 0

18 45000 Lines of code over time Working Nematus Model MTM First commit 0

19 45000 Lines of code over time Working Nematus Model MTM First commit AARA

20 45000 Lines of code over time Working Nematus Model MTM 2016 Multi-gpu Training First commit AARA

21 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA

22 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA MJD leaves/ RG joins UEdin 0

23 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0

24 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0

25 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v First commit AARA MJD leaves/ RG joins UEdin 0

26 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v 100 v 110 Batched decoding First commit AARA MJD leaves/ RG joins UEdin 0

27 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft 0

28 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 0

29 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting CPU backend Restoration 0

30 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization Autotuning 0

31 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization v 160 Autotuning 30% faster NCCL Tied layers 0

32 Going further Reduce dependencies for CPU version to zero Reduce dependencies for GPU version to CUDA Become faster and more versatile Research tool with immediate deployment

33

34

35

36

37

38

39

40 Part II Decoding on the CPU

41 Quality first speed later Lessons from WNMT shared task on efficient decoding; Sequence-level knowledge distillation (Kim & Rush 2016): Training four Transformer-big models on official task data (teacher); Translate entire EN data to DE-trans (8-best list); Select sentences with highest sentence-level BLEU based on DE-orig data; Train students on EN/DE-trans data

42 Teacher Transformer Big MiB (Scale preserving)

43 Teacher Transformer Big Students Transformer beam=1 Big Base Small MiB MiB MiB MiB 264 (Scale preserving)

44 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist +AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

45 W b x + V=36,000 W' b' x + V'=578

46 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

47 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ]

48 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ] Average attention network (Zhang et al 2018): C = gate(ffn( V), Q) Y = norm(q + C) Gate and FFN optional Complexity per step: O(1) Because: v t = 1 t ((t 1) v t 1 + v t ) Basically a weird RNN Authors report 4x speed-up for beam=4

49 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

50 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 )

51 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q

52 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q x W = x (W T ) T q(x) q(w T ) T = q(x) q(w T )

53 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W

54 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

55 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W

56 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W parameter

57 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant parameter

58 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter

59 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized

60 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

61 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized

62 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)

63 h 1 = hash(dot(x, W)) = hash(dot) hash(dims(x)) hash(dims(w)) = hash(dot) hash({11, 512}) hash({512, 512}) h 2 = hash(dot 16 (x, W)) = hash(dot 16 ) hash(dims(x)) hash(dims(w)) = hash(dot 16 ) hash({11, 512}) hash({512, 512}) We can decrease granularity via integer-dividing dimensions by a given factor, we choose 5

64 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)

65 autotune(x, W) dot(x, W) dot 16 (x, W) measure(h 1 ) measure(h 2 ) memoized x W measure(h 2 ) quantize quantize constant x transpose constant W parameter

66 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

67 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

68 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Marian v Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)

69 Transformer-base Shortlist AAN (modified) 2654 using --optimize Caching att keys/values w/o --optimize Marian v Latency per sentence in milliseconds for newstest2014 (batch-size: 1 sentence)

70 Map GPU and CPU performance into comparable space [w/$] newstest2014de consists of 62,954 tokens Type Price [$/h] p3x2large 3259 m5large 0102 [w/$] = 62, 954 [w] Translation time [s] 3, 600 [s/h] Instance price [$/h]

71 Others GPU BLEU Million translated source tokens per USD (log scale)

72 300 Marian GPU Submissions Others GPU BLEU Million translated source tokens per USD (log scale)

73 Others CPU BLEU Million translated source tokens per USD (log scale)

74 300 Marian CPU Submissions Others CPU i i i 16i 15i 17i 23i BLEU i Million translated source tokens per USD (log scale)

75 300 Marian GPU Others GPU Submissions Marian CPU Others CPU i i 13 BLEU i 16i 15i 17i 22 23i i Million translated source tokens per USD (log scale)

76 300 Marian v14 GPU Others GPU Marian v14 CPU Others CPU BLEU Million translated source tokens per USD (log scale)

77 300 Marian v14 GPU Marian v15 GPU Others GPU Marian v14 CPU Marian v15 CPU Others CPU BLEU Million translated source tokens per USD (log scale)

78 Future work More experiments with Teacher-Student scenario; More SIMD operations on the CPU; All operations in fixed-point arithmetics on the CPU; 8-bit matrix product on the CPU; Mixed precision (FP16) on the GPU; Optimize beam-search for batched translation;

79 Translator

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Today s Lecture. Dropout

Today s Lecture. Dropout Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information

Analysis of techniques for coarse-to-fine decoding in neural machine translation

Analysis of techniques for coarse-to-fine decoding in neural machine translation Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Differentiable Fine-grained Quantization for Deep Neural Network Compression Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

A Practitioner s Guide to MXNet

A Practitioner s Guide to MXNet 1/34 A Practitioner s Guide to MXNet Xingjian Shi Hong Kong University of Science and Technology (HKUST) HKUST CSE Seminar, March 31st, 2017 2/34 Outline 1 Introduction Deep Learning Basics MXNet Highlights

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Deep Learning intro and hands-on tutorial

Deep Learning intro and hands-on tutorial Deep Learning intro and hands-on tutorial Π passalis@csd.auth.gr Ε. ώ Π ώ. Π ΠΘ 1 / 53 Deep Learning 2 / 53 Δ Έ ( ) ω π μ, ώπ ώ π ω (,...), π ώ π π π π ω ω ώπ ώ π, (biologically plausible) Δ π ώώ π ώ Ε

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT versus 1 Agenda Deep Learning: Motivation Learning: Backpropagation Deep architectures I: Convolutional Neural Networks (CNN) Deep architectures II: Stacked Auto Encoders (SAE) Caffe Deep Learning Toolbox:

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Machine Learning. Boris

Machine Learning. Boris Machine Learning Boris Nadion boris@astrails.com @borisnadion @borisnadion boris@astrails.com astrails http://astrails.com awesome web and mobile apps since 2005 terms AI (artificial intelligence)

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky. Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1 Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

More on Neural Networks

More on Neural Networks More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,

More information

Structured Neural Networks (I)

Structured Neural Networks (I) Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University

More information

Word Attention for Sequence to Sequence Text Understanding

Word Attention for Sequence to Sequence Text Understanding Word Attention for Sequence to Sequence Text Understanding Lijun Wu 1, Fei Tian 2, Li Zhao 2, Jianhuang Lai 1,3 and Tie-Yan Liu 2 1 School of Data and Computer Science, Sun Yat-sen University 2 Microsoft

More information

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 5 Learning in a Single Neuron CSC321 Lecture 5 Learning in a Single Neuron Roger Grosse and Nitish Srivastava January 21, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 1 / 14

More information

Advances in Neural Machine Translation

Advances in Neural Machine Translation Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt Institute for Language, Cognition and Computation University of Edinburgh November 1 2016 Sennrich, Birch,

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Artificial Neural Networks

Artificial Neural Networks Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Memory-Augmented Attention Model for Scene Text Recognition

Memory-Augmented Attention Model for Scene Text Recognition Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

More information

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. COGS Q250 Fall 2012 Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. For the first two questions of the homework you will need to understand the learning algorithm using the delta

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

Neural Hidden Markov Model for Machine Translation

Neural Hidden Markov Model for Machine Translation Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and

More information

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. deeplearning.ai. Why sequence models? Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There

More information

Learning Long-Term Dependencies with Gradient Descent is Difficult

Learning Long-Term Dependencies with Gradient Descent is Difficult Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua

More information

An efficient way to learn deep generative models

An efficient way to learn deep generative models An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto Joint work with: Ruslan Salakhutdinov, Yee-Whye

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information