Translator
|
|
- Marsha Wells
- 5 years ago
- Views:
Transcription
1 Translator
2 Marian Rejewski
3 A few words about Marian Portable C++ code with minimal dependencies (CUDA or MKL and still Boost); Single engine for training and decoding on GPU and CPU; Custom auto-diff engine with dynamic graphs (similar to DyNet); Optimized towards NMT and
4 Part I A Machine Translation Marathon 2016 project
5 The first commit Commit: 6a7c93 Date: May 4th, 2016 Message: very cool Lines: 155
6 # include <iostream > # include "madh" int main(int argc, char** argv) { Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; std::vector <Var > x = { x0, x1, x2 }; set_zero_all_adjoints (); ygrad(); } std:: cerr << "y = " << yval() << std:: endl; for(int i = 0; i < xsize(); ++i) std:: cerr << " dy/ dx_" << i << " = " << x[i] adj() << std:: endl;
7 Var x0 = 1, x1 = 2, x2 = 3; auto y = x0 + x0 + log( x2) + x1; y = dy/dx_0 = 2 dy/dx_1 = 1 dy/dx_2 =
8 45000 Lines of code over time
9 45000 Lines of code over time First commit 0
10 45000 Lines of code over time MTM First commit 0
11 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz
12 A Neural Network Toolkit for MT Maximiliana Behnke Tomasz Dwojak Marcin Junczys-Dowmunt Roman Grundkiewicz Andre Martins Hieu Hoang Lane Schwartz
13 Why create another NN toolkit? Flexibility Add functionality easier & faster Speed Pure C++ implementation GPU-enabled (CPU may come soon) Learn about Deep Learning Implement everything from scratch by ourselves
14 What we ve achieved this week? Framework to create computation graphs Simple feedforward NN RNN, GRU, LSTM Binary or multiclass classifier Forward step Classify, given input data and weights Backward step Learn weights using backpropagation Tested with small datasets MNIST (digit image recognition task) MT Documentation Doxygen
15 + - const const const param const const param const const param const const param const const const input log softmax + const + tanh const input
16 What needs to be finished? Basic features: Data shuffling More random distributions Model serialization & deserialization Documentation
17 45000 Lines of code over time MTM First commit 0
18 45000 Lines of code over time Working Nematus Model MTM First commit 0
19 45000 Lines of code over time Working Nematus Model MTM First commit AARA
20 45000 Lines of code over time Working Nematus Model MTM 2016 Multi-gpu Training First commit AARA
21 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA
22 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training First commit AARA MJD leaves/ RG joins UEdin 0
23 45000 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0
24 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells First commit AARA MJD leaves/ RG joins UEdin 0
25 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v First commit AARA MJD leaves/ RG joins UEdin 0
26 Lines of code over time Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training Working Transformer Model Deep Transition RNN Cells v 100 v 110 Batched decoding First commit AARA MJD leaves/ RG joins UEdin 0
27 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft 0
28 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 0
29 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting CPU backend Restoration 0
30 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization Autotuning 0
31 Lines of code over time First commit Working Nematus Model MTM 2016 Marian decoder Multi-gpu Training AARA Working Transformer Model Deep Transition RNN Cells MJD leaves/ RG joins UEdin v 100 v 110 Batched decoding MJD joins Microsoft v 130 v 140 Data weighting v 150 CPU 16bit backend CPU MM Restoration Memoization v 160 Autotuning 30% faster NCCL Tied layers 0
32 Going further Reduce dependencies for CPU version to zero Reduce dependencies for GPU version to CUDA Become faster and more versatile Research tool with immediate deployment
33
34
35
36
37
38
39
40 Part II Decoding on the CPU
41 Quality first speed later Lessons from WNMT shared task on efficient decoding; Sequence-level knowledge distillation (Kim & Rush 2016): Training four Transformer-big models on official task data (teacher); Translate entire EN data to DE-trans (8-best list); Select sentences with highest sentence-level BLEU based on DE-orig data; Train students on EN/DE-trans data
42 Teacher Transformer Big MiB (Scale preserving)
43 Teacher Transformer Big Students Transformer beam=1 Big Base Small MiB MiB MiB MiB 264 (Scale preserving)
44 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist +AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
45 W b x + V=36,000 W' b' x + V'=578
46 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) +MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
47 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ]
48 Multiplicative attention: Q = QW q + b q K = KW k + b k V = VW v + b v C = softmax(q (K ) T ) V Y = norm(q + C) During training: Q = K = V During translation: Q K; K = V Complexity per step: O(n) Because: c t = softmax(q t (K <t )T ) V <t K <t+1 = [K <t ; q t ] Average attention network (Zhang et al 2018): C = gate(ffn( V), Q) Y = norm(q + C) Gate and FFN optional Complexity per step: O(1) Because: v t = 1 t ((t 1) v t 1 + v t ) Basically a weird RNN Authors report 4x speed-up for beam=4
49 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
50 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 )
51 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q
52 Code based on Devlin (2017), extended to AVX512 q(x) = int16(x 2 10 ) A q B q = A q B T q x W = x (W T ) T q(x) q(w T ) T = q(x) q(w T )
53 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W
54 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning +Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
55 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W
56 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W parameter
57 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant parameter
58 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter
59 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized
60 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
61 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized
62 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)
63 h 1 = hash(dot(x, W)) = hash(dot) hash(dims(x)) hash(dims(w)) = hash(dot) hash({11, 512}) hash({512, 512}) h 2 = hash(dot 16 (x, W)) = hash(dot 16 ) hash(dims(x)) hash(dims(w)) = hash(dot 16 ) hash({11, 512}) hash({512, 512}) We can decrease granularity via integer-dividing dimensions by a given factor, we choose 5
64 dot(x, W) x W dot 16 (x, W) quantize x quantize transpose W constant constant parameter memoized autotune(x, W)
65 autotune(x, W) dot(x, W) dot 16 (x, W) measure(h 1 ) measure(h 2 ) memoized x W measure(h 2 ) quantize quantize constant x transpose constant W parameter
66 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int Memoization +Auto-tuning Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
67 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
68 MM-MKL MM-int16 Quant-int16 Other Transformer-base Shortlist AAN (modified) MM-int16 +Memoization +Auto-tuning --optimize Caching att keys/values w/o --optimize Marian v Seconds to translate newstest2014 (batch-size: ca 384 words 5 to 25 sentences)
69 Transformer-base Shortlist AAN (modified) 2654 using --optimize Caching att keys/values w/o --optimize Marian v Latency per sentence in milliseconds for newstest2014 (batch-size: 1 sentence)
70 Map GPU and CPU performance into comparable space [w/$] newstest2014de consists of 62,954 tokens Type Price [$/h] p3x2large 3259 m5large 0102 [w/$] = 62, 954 [w] Translation time [s] 3, 600 [s/h] Instance price [$/h]
71 Others GPU BLEU Million translated source tokens per USD (log scale)
72 300 Marian GPU Submissions Others GPU BLEU Million translated source tokens per USD (log scale)
73 Others CPU BLEU Million translated source tokens per USD (log scale)
74 300 Marian CPU Submissions Others CPU i i i 16i 15i 17i 23i BLEU i Million translated source tokens per USD (log scale)
75 300 Marian GPU Others GPU Submissions Marian CPU Others CPU i i 13 BLEU i 16i 15i 17i 22 23i i Million translated source tokens per USD (log scale)
76 300 Marian v14 GPU Others GPU Marian v14 CPU Others CPU BLEU Million translated source tokens per USD (log scale)
77 300 Marian v14 GPU Marian v15 GPU Others GPU Marian v14 CPU Marian v15 CPU Others CPU BLEU Million translated source tokens per USD (log scale)
78 Future work More experiments with Teacher-Student scenario; More SIMD operations on the CPU; All operations in fixed-point arithmetics on the CPU; 8-bit matrix product on the CPU; Mixed precision (FP16) on the GPU; Optimize beam-search for batched translation;
79 Translator
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationMachine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26
Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and
More informationToday s Lecture. Dropout
Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation
More informationTask-Oriented Dialogue System (Young, 2000)
2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see
More informationAnalysis of techniques for coarse-to-fine decoding in neural machine translation
Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationModelling Time Series with Neural Networks. Volker Tresp Summer 2017
Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationDifferentiable Fine-grained Quantization for Deep Neural Network Compression
Differentiable Fine-grained Quantization for Deep Neural Network Compression Hsin-Pai Cheng hc218@duke.edu Yuanjun Huang University of Science and Technology of China Anhui, China yjhuang@mail.ustc.edu.cn
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationEdinburgh Research Explorer
Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli
More informationDeep Learning (CNNs)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationRecurrent and Recursive Networks
Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single
More informationA Practitioner s Guide to MXNet
1/34 A Practitioner s Guide to MXNet Xingjian Shi Hong Kong University of Science and Technology (HKUST) HKUST CSE Seminar, March 31st, 2017 2/34 Outline 1 Introduction Deep Learning Basics MXNet Highlights
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationLecture 15: Exploding and Vanishing Gradients
Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationStephen Scott.
1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationDeep Learning intro and hands-on tutorial
Deep Learning intro and hands-on tutorial Π passalis@csd.auth.gr Ε. ώ Π ώ. Π ΠΘ 1 / 53 Deep Learning 2 / 53 Δ Έ ( ) ω π μ, ώπ ώ π ω (,...), π ώ π π π π ω ω ώπ ώ π, (biologically plausible) Δ π ώώ π ώ Ε
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationAgenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT
versus 1 Agenda Deep Learning: Motivation Learning: Backpropagation Deep architectures I: Convolutional Neural Networks (CNN) Deep architectures II: Stacked Auto Encoders (SAE) Caffe Deep Learning Toolbox:
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationSupervised Learning. George Konidaris
Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationMachine Learning. Boris
Machine Learning Boris Nadion boris@astrails.com @borisnadion @borisnadion boris@astrails.com astrails http://astrails.com awesome web and mobile apps since 2005 terms AI (artificial intelligence)
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationEE-559 Deep learning LSTM and GRU
EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationInternal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.
Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,
More informationRecurrent Neural Networks. Jian Tang
Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationImproved Learning through Augmenting the Loss
Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language
More informationIndex. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,
Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationLanguage Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1
Language Modelling Steve Renals Automatic Speech Recognition ASR Lecture 11 6 March 2017 ASR Lecture 11 Language Modelling 1 HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationDeep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.
Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationMore on Neural Networks
More on Neural Networks Yujia Yan Fall 2018 Outline Linear Regression y = Wx + b (1) Linear Regression y = Wx + b (1) Polynomial Regression y = Wφ(x) + b (2) where φ(x) gives the polynomial basis, e.g.,
More informationStructured Neural Networks (I)
Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University
More informationWord Attention for Sequence to Sequence Text Understanding
Word Attention for Sequence to Sequence Text Understanding Lijun Wu 1, Fei Tian 2, Li Zhao 2, Jianhuang Lai 1,3 and Tie-Yan Liu 2 1 School of Data and Computer Science, Sun Yat-sen University 2 Microsoft
More informationCSC321 Lecture 5 Learning in a Single Neuron
CSC321 Lecture 5 Learning in a Single Neuron Roger Grosse and Nitish Srivastava January 21, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 1 / 14
More informationAdvances in Neural Machine Translation
Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt Institute for Language, Cognition and Computation University of Edinburgh November 1 2016 Sennrich, Birch,
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationArtificial Neural Networks
Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks
More informationMachine Learning Lecture 10
Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes
More informationMemory-Augmented Attention Model for Scene Text Recognition
Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationAnalysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function
Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationCSC321 Lecture 10 Training RNNs
CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep
More informationCOGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.
COGS Q250 Fall 2012 Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. For the first two questions of the homework you will need to understand the learning algorithm using the delta
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationRecurrent Neural Networks
Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why
More informationNeural Hidden Markov Model for Machine Translation
Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and
More informationA Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN
A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationIntroduction to RNNs!
Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN
More informationRandom Coattention Forest for Question Answering
Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu
More informationRecurrent Neural Networks. deeplearning.ai. Why sequence models?
Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There
More informationLearning Long-Term Dependencies with Gradient Descent is Difficult
Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua
More informationAn efficient way to learn deep generative models
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto Joint work with: Ruslan Salakhutdinov, Yee-Whye
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationRecurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves
Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text
More informationIntroduction to Convolutional Neural Networks 2018 / 02 / 23
Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 3 October 207 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationDeep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang
Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping
More information