Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Size: px
Start display at page:

Download "Deep Learning Tutorial. 李宏毅 Hung-yi Lee"

Transcription

1 Deep Learning Tutorial 李宏毅 Hung-yi Lee

2 Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical part

3 Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory

4 Part I: Introduction of Deep Learning What people already knew in 1980s

5 Example Application Handwriting Digit Recognition Machine 2

6 Handwriting Digit Recognition Input Output x 1 y is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 y y is 2 is 0 The image is 2 Each dimension represents the confidence of a digit.

7 Example Application Handwriting Digit Recognition x 1 y 1 x 2 y 2 Machine 2 x 256 f: R 256 R 10 y 10 In deep learning, the function f is represented by neural network

8 Element of Neural Network Neuron f: R K R a 1 w 1 z a 1 w 1 a 2 w 2 a K w K b a 2 w 2 w K z z a a K weights b bias Activation function

9 Neural Network neuron Input x 1 Layer 1 Layer 2 Layer L Output y 1 x 2 y 2 x N y M Input Layer Hidden Layers Output Layer Deep means many hidden layers

10 Example of Neural Network Sigmoid Function z 1 1 e z z z

11 Example of Neural Network

12 Example of Neural Network f: R 2 R 2 f 1 1 = f 0 0 = Different parameters define different function

13 Matrix Operation y y 2 σ =

14 Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M σ W 1 x + b 1 σ W 2 a 1 + b 2 σ W L a L-1 + b L

15 Neural Network x 1 y 1 x 2 x N W 1 W 2 W L b 1 b 2 b L x a 1 a 2 y y 2 y M y = f x Using parallel computing techniques to speed up matrix operation = σ W L σ W 2 σ W 1 x + b 1 + b 2 + b L

16 Softmax Softmax layer as the output layer Ordinary Layer z 1 z 2 y y 1 z 1 2 z 2 In general, the output of network can be any value. z 3 y 3 z 3 May not be easy to interpret

17 Softmax Softmax layer as the output layer Softmax Layer Probability: 1 > y i > 0 i y i = 1 z 1 z 2 z 3 3 e 1 e e e z 1 20 e z 2 e z 3 3 j z j e 0.88 y 0.12 y 0 y e e e z z z j 1 3 j 1 3 j 1 e e e z z z j j j

18 How to set network parameters x 1 θ = W 1, b 1, W 2, b 2, W L, b L y is 1 16 x 16 = 256 Ink 1 No ink 0 x 2 x 256 y y 10 Set the network parameters θ such that Input: Input: Softmax How to let the neural network achieve this is 2 is 0 y 1 has the maximum value y 2 has the maximum value

19 Training Data Preparing training data: images and their labels Using the training data to find the network parameters.

20 Cost Given a set of network parameters θ, each example has a cost value. 1 x 1 y x 2 x 256 y y Cost L(θ) 0 0 Cost can be Euclidean distance or cross entropy of the network output and target target

21 Total Cost For all training data x 1 x 2 x R NN NN NN y 1 y 2 L 1 θ L 2 θ y 1 y 2 x 3 NN y 3 y 3 L 3 θ y R y R L R θ Total Cost: C θ = R r=1 L r θ How bad the network parameters θ is on this task Find the network parameters θ that minimize this value

22 Gradient Descent Error Surface The colors represent the value of C. Assume there are only two parameters w 1 and w 2 in a network. θ = w 1, w 2 Randomly pick a starting point θ 0 w 2 η C θ 0 θ 0 C θ 0 θ C θ 0 = C θ0 / w 1 C θ 0 / w 2 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1

23 Gradient Descent w 2 θ 0 Eventually, we would reach a minima.. η C θ 1 θ 2 η C θ 2 C θ1 θ 2 θ1 C Randomly pick a starting point θ 0 Compute the negative gradient at θ 0 C θ 0 Times the learning rate η η C θ 0 w 1

24 Local Minima Gradient descent never guarantee global minima Different initial point θ 0 C Reach different minima, so different results w 1 w 2 Who is Afraid of Non-Convex Loss Functions? _lecun_wia/

25 Besides local minima cost Very slow at the plateau Stuck at saddle point Stuck at local minima C θ 0 C θ = 0 parameter space C θ = 0

26 In physical world Momentum How about put this phenomenon in gradient descent?

27 Momentum cost Still not guarantee reaching global minima, but give some hope Movement = Negative of Gradient + Momentum Negative of Gradient Momentum Real Movement Gradient = 0

28 Mini-batch Mini-batch Mini-batch x 1 NN y 1 y 1 L 1 x 31 NN y 31 y 31 x 2 NN L 31 y 2 y 2 L 2 x 16 NN y 16 y 16 L 16 Randomly initialize θ 0 Pick the 1 st batch C = L 1 + L 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = L 2 + L 16 + θ 2 θ 1 η C θ 1 C is different each time when we update parameters!

29 Mini-batch Original Gradient Descent With Mini-batch unstable The colors represent the total C on all training data.

30 Mini-batch Mini-batch Mini-batch Faster Better! x 1 NN y 1 y 1 C 1 x 31 NN y 31 y 31 x 2 NN C 31 y 2 y 2 C 2 x 16 NN y 16 y 16 C 16 Randomly initialize θ 0 Pick the 1 st batch C = C 1 + C 31 + θ 1 θ 0 η C θ 0 Pick the 2 nd batch C = C 2 + C 16 + θ 2 θ 1 η C θ 1 Until all mini-batches have been picked one epoch Repeat the above process

31 Backpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: 5_2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automatically Ref: ture/theano%20dnn.ecm.mp4/index.html

32 Part II: Why Deep?

33 Deeper is Better? Layer X Size Word Error Rate (%) 1 X 2k X 2k X 2k X 2k 17.8 Layer X Size Word Error Rate (%) Not surprised, more parameters, better performance 5 X 2k X X 2k X X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech

34 Universality Theorem Any continuous function f f : R N R M Can be realized by a network with one hidden layer (given enough hidden neurons) Reference for the reason: eplearning.com/chap4.html Why Deep neural network not Fat neural network?

35 Fat + Short v.s. Thin + Tall The same number of parameters Which one is better? x1 x 2 x N Shallow x1 x 2 x N Deep

36 Fat + Short v.s. Thin + Tall Layer X Size Word Error Rate (%) 1 X 2k X 2k X 2k X 2k 17.8 Layer X Size Word Error Rate (%) 5 X 2k X X 2k X X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech

37 Why Deep? Deep Modularization Image Classifier 1 Classifier 2 Classifier 3 Classifier 4 weak Girls with long hair Boys with long hair Girls with short hair Boys with short hair 長髮長髮女長髮女長髮女女長髮男 Little examples 短髮短髮女短髮女短髮女女短髮短髮男短髮男短髮男男

38 Why Deep? Each basic classifier can have sufficient training examples. Deep Modularization Image Boy or Girl? Basic Classifier Long or short? Classifiers for the attributes 長髮長髮短髮女長髮短髮女長髮女短髮女女短髮女女女 長髮長髮女長髮女長髮女女長髮男 v.s. v.s. 長髮男短髮短髮男短髮男短髮男男 短髮短髮女短髮女短髮女短髮女短髮男短髮男短髮男男

39 Why Deep? can be trained by little data Deep Modularization Classifier 1 Girls with long hair Image Boy or Girl? Basic Classifier Long or short? Sharing by the following classifiers as module Classifier 2 fine Classifier 3 Classifier 4 Boys with long Little hair data Girls with short hair Boys with short hair

40 Why Deep? Deep Learning also works on small data set like TIMIT. Deep Modularization x 1 Less training data? x 2 x N The modularization is automatically learned from data. The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as module

41 Hand-crafted kernel function SVM Deep Learning x 1 Apply simple classifier Source of image: Learnable kernel φ x simple classifier y 1 x x 2 y 2 x N y M

42 Hard to get the power of Deep Before 2006, deeper usually does not imply better.

43 Part III: Tips for Training DNN

44 Recipe for Learning

45 Recipe for Learning Don t forget! Modify the Network Better optimization Strategy Preventing Overfitting overfitting

46 Recipe for Learning Modify the Network New activation functions, for example, ReLU or Maxout Better optimization Strategy Adaptive learning rates Prevent Overfitting Dropout Only use this approach when you already obtained good results on the training data.

47 Part III: Tips for Training DNN New Activation Function

48 ReLU Rectified Linear Unit (ReLU) σ z a = 0 a a = z [Xavier Glorot, AISTATS 11] [Andrew L. Maas, ICML 13] [Kaiming He, arxiv 15] z Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem

49 Vanishing Gradient Problem x 1 x N In x2006, 2 people used RBM pre-training. In 2015, people use ReLU. y 1 y 2 y M Smaller gradients Learn very slow Almost random Larger gradients Learn very fast Already converge based on random!?

50 Vanishing Gradient Problem Smaller gradients x 1 x 2 Small output y 1 y 2 y 1 y 2 x N + w y M C + C Large input y M Intuitive way to compute the gradient C w =? C w

51 ReLU a a = z 0 a = 0 z x 1 x y 1 y 2 0

52 ReLU a a = z A Thinner linear network a = 0 z x 1 y 1 x 2 y 2 Do not have smaller gradients

53 Maxout ReLU is a special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Input x neuron Max Max 2 x Max Max 4 You can have more than 2 elements in a group.

54 Maxout ReLU is a special cases of Maxout Learnable activation function [Ian J. Goodfellow, ICML 13] Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group 2 elements in a group 3 elements in a group

55 Part III: Tips for Training DNN Adaptive Learning Rate

56 Learning Rate Set the learning rate η carefully η C θ 0 If learning rate is too large w 2 Cost may not decrease after each update C θ 0 θ 0 w 1

57 Learning Rate Can we give different parameters Set the learning different learning rate η carefully rates? If learning rate is too large w 2 η C θ 0 θ 0 C θ 0 Cost may not decrease after each update If learning rate is too small Training would be too slow w 1

58 Adagrad Original Gradient Descent θ t θ t 1 η C θ t 1 Each parameter w are considered separately w t+1 w t ߟ w g t g t = C θt w Parameter dependent learning rate ߟ w = η constant i=0 t g i 2 Summation of the square of the previous derivatives

59 Adagrad ߟ w = η t i=0 g i 2 Learning rate: η g 0 g 1 w 1 w η = η 0.1 = η 0.22 g 0 g Learning rate: η 20 2 η Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller derivatives, larger learning rate, and vice versa = η 20 = η 22 Why?

60 Larger derivatives Smaller Learning Rate Smaller Derivatives Larger Learning Rate 2. Smaller derivatives, larger learning rate, and vice versa Why?

61 Not the whole story Adagrad [John Duchi, JMLR 11] RMSprop Adadelta [Matthew D. Zeiler, arxiv 12] Adam [Diederik P. Kingma, ICLR 15] AdaSecant [Caglar Gulcehre, arxiv 14] No more pesky learning rates [Tom Schaul, arxiv 12]

62 Part III: Tips for Training DNN Dropout

63 Dropout Pick a mini-batch θ t θ t 1 η C θ t 1 Training: Each time before computing the gradients Each neuron has p% to dropout

64 Pick a mini-batch Dropout θ t θ t 1 η C θ t 1 Training: Thinner! Each time before computing the gradients Each neuron has p% to dropout The structure of the network is changed. Using the new network for training For each mini-batch, we resample the dropout neurons

65 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (1-p)% Assume that the dropout rate is 50%. If a weight w = 1 by training, set w = 0.5 for testing.

66 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

67 Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w 1 w 2 w 3 w 4 z No dropout w 1 w 2 w 3 w 4 Weights from training z 2z z Weights multiply (1-p)% z z

68 Dropout is a kind of ensemble. Ensemble Training Set Set 1 Set 2 Set 3 Set 4 Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures

69 Dropout is a kind of ensemble. Ensemble Testing data x Network 1 Network 2 Network 3 Network 4 y 1 y 2 y 3 y 4 average

70 Dropout is a kind of ensemble. minibatch 1 minibatch 2 minibatch 3 minibatch 4 Training of Dropout M neurons 2 M possible networks Using one mini-batch to train one network Some parameters in the network are shared

71 Dropout is a kind of ensemble. Testing of Dropout testing data x All the weights multiply (1-p)% y 1 y 2 y 3 average y

72 More about dropout More reference for dropout [Nitish Srivastava, JMLR 14] [Pierre Baldi, NIPS 13][Geoffrey E. Hinton, arxiv 12] Dropout works better with Maxout [Ian J. Goodfellow, ICML 13] Dropconnect [Li Wan, ICML 13] Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout [S.J. Rennie, SLT 14] Dropout rate decreases by epochs Standout [J. Ba, NISP 13] Each neural has different dropout rate

73 Part IV: Neural Network with Memory

74 Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. apple DNN people location organization none

75 Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence. target ORG NONE y 1 y 2 y 3 y 4 y 5 y 6 y 7 DNN DNN DNN DNN DNN DNN DNN x 1 x 2 x 3 x 4 x 5 x 6 x 7 the president of apple eats an apple DNN needs memory! target

76 Recurrent Neural Network (RNN) y 1 y 2 The output of hidden layer are stored in the memory. copy a1 a2 Memory can be considered as another input. x1 x2

77 RNN W o y 1 y 2 y 3 W o copy copy W a o 1 a 2 a 3 a 1 a 2 W i W h W W i h W i x 1 x 2 x 3 The same network is used again and again. Output y i depends on x 1, x 2, x i

78 RNN How to train? y 1 target y 2 target y 3 target L 1 L 2 L 3 y 1 y 2 y 3 W o W h W o W h W o W i W i x 1 x 2 x 3 W i Find the network parameters to minimize the total cost: Backpropagation through time (BPTT)

79 Of course it can be deep y t y t+1 y t+2 x t x t+1 x t+2

80 Bidirectional RNN x t x t+1 x t+2 y t y t+1 y t+2 x t x t+1 x t+2

81 Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. E.g. Speech Recognition Problem? Why can t it be 好棒棒 Output: Input: 好棒 好好好 (character sequence) Trimming 棒棒棒棒棒 (vector sequence)

82 Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. Connectionist Temporal Classification (CTC) [Alex Graves, ICML 06][Alex Graves, ICML 14][Haşim Sak, Interspeech 15][Jie Li, Interspeech 15][Andrew Senior, ASRU 15] 好棒 好棒棒 Add an extra symbol φ representing null 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ

83 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) machine learning Containing all information about input sequence

84 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 慣 性 machine learning Don t know when to stop

85 Many to Many (No Limitation) 推 tlkagk: ========= 斷 ========== Ref: E6%8E%A8%E6%96%87 ( 鄉民百科 )

86 Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning 機器學習 ) 機 器 學 習 === machine learning Add a symbol === ( 斷 ) [Ilya Sutskever, NIPS 14][Dzmitry Bahdanau, arxiv 15]

87 感謝曾柏翔同學提供實驗結果 Unfortunately RNN-based network is not always easy to learn Real experiments on Language modeling sometimes Lucky

88 The error surface is rough. The error surface is either very flat or very steep. Clipping Cost w 2 w 1 [Razvan Pascanu, ICML 13]

89 Why? w = 1 w = 1.01 y 1000 = 1 y Large gradient Small Learning rate? w = 0.99 w = 0.01 y y small gradient Large Learning rate? =w 999 Toy Example y 1 y 2 y 3 y w w w

90 Helpful Techniques Nesterov s Accelerated Gradient (NAG): Advance momentum method RMS Prop Advanced approach to give each parameter different learning rates Considering the change of Second derivatives Long Short-term Memory (LSTM) Can deal with gradient vanishing (not gradient explode)

91 Long Short-term Memory (LSTM) Signal control the output gate (Other part of the network) Signal control the input gate (Other part of the network) Other part of the network Output Gate Memory Cell Input Gate Other part of the network Forget Gate LSTM Special Neuron: 4 inputs, 1 output Signal control the forget gate (Other part of the network)

92 a = h c f z o z o f z o multiply h c Activation function f is usually a sigmoid function Between 0 and 1 c f z f Mimic open and close gate c c z f cf z f z i f z i g z f z i multiply g z c = g z f z i + cf z f z

93 Original Network: Simply replace the neurons with LSTM a 1 a 2 z 1 z 2 x 1 x 2 Input

94 + a 1 a times of parameters x 1 x 2 Input

95 LSTM Extension: peephole y t y t+1 c t-1 c t c t z f z i z z o z f z i z z o c t-1 h t-1 x t c t h t x t+1

96 Other Simpler Alternatives Gated Recurrent Unit (GRU) Structurally Constrained Recurrent Network (SCRN) [Cho, EMNLP 14] [Tomas Mikolov, ICLR 15] Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arxiv 15] Outperform or be comparable with LSTM in 4 different tasks

97 What is the next wave? Attention-based Model Reading Head Reading Head Controller Internal memory or information from output Writing Head Writing Head Controller Input x DNN/LSTM output y Already applied on speech recognition, caption generation, QA, visual QA

98 What is the next wave? Attention-based Model End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. arxiv Pre-Print, Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arxiv Pre-Print, 2014 Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Kumar et al. arxiv Pre-Print, 2015 Neural Machine Translation by Jointly Learning to Align and Translate. D. Bahdanau, K. Cho, Y. Bengio; International Conference on Representation Learning Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu et. al.. arxiv Pre-Print, Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arxiv Pre-Print, Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K. Kavukcuoglu. In NIPS, A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush, S. Chopra and J. Weston. EMNLP 2015.

99 Concluding Remarks

100 Concluding Remarks Introduction of deep learning Discussing some reasons using deep learning New techniques for deep learning ReLU, Maxout Giving all the parameters different learning rates Dropout Network with memory Recurrent neural network Long short-term memory (LSTM)

101 Reading Materials Neural Networks and Deep Learning written by Michael Nielsen Deep Learning (not finished yet) Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

102 Thank you for your attention!

103 Acknowledgement 感謝 Ryan Sun 來信指出投影片上的錯字

104 Appendix

105 Matrix Operation x y 1 x y σ W x b = a

106 Why Deep? Logic Circuits A two levels of basic logic gates can represent any Boolean function. However, no one uses two levels of logic gates to build computers Using multiple layers of logic gates to build some functions are much simpler (less gates needed).

107 Boosting Weak classifier Input x Weak classifier Combine Weak classifier Deep Learning Weak classifier Boosted weak classifier Boosted Boosted weak classifier x 1 x 2 x N

108 Maxout ReLU is a special cases of Maxout Input x 1 w b z ReLU a Input x 1 0 w 0 b + z 1 Max a + z 2 max z 1, z 2 a z = wx + b x a z 1 = wx + b x z 2 =0

109 Maxout ReLU is a special cases of Maxout Input x 1 w b z ReLU a a Input w x w b 1 z = wx + b x b + z 1 Max a + z 2 a max z 1, z 2 Learnable Activation Function z 1 = wx + b x z 2 = w x + b

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory Part I: Introduction

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

Deep learning attracts lots of attention.

Deep learning attracts lots of attention. Deep Learning Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean Ups and downs of Deep Learning

More information

Deep Learning. Hung-yi Lee 李宏毅

Deep Learning. Hung-yi Lee 李宏毅 Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron

More information

Deep Learning, Data Irregularities and Beyond

Deep Learning, Data Irregularities and Beyond Deep Learning, Data Irregularities and Beyond Dr. Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata 700 108, India. E-mail: swagatam.das@isical.ac.in Road

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Gated RNN & Sequence Generation. Hung-yi Lee 李宏毅

Gated RNN & Sequence Generation. Hung-yi Lee 李宏毅 Gated RNN & Sequence Generation Hung-yi Lee 李宏毅 Outline RNN with Gated Mechanism Sequence Generation Conditional Sequence Generation Tips for Generation RNN with Gated Mechanism Recurrent Neural Network

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST 1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning. Scratching the surface Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

More information

Deep Gate Recurrent Neural Network

Deep Gate Recurrent Neural Network JMLR: Workshop and Conference Proceedings 63:350 365, 2016 ACML 2016 Deep Gate Recurrent Neural Network Yuan Gao University of Helsinki Dorota Glowacka University of Helsinki gaoyuankidult@gmail.com glowacka@cs.helsinki.fi

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Regression. Hung-yi Lee 李宏毅

Regression. Hung-yi Lee 李宏毅 Regression Hung-yi Lee 李宏毅 Regression: Output a scalar Stock Market Forecast f Self-driving Car = Dow Jones Industrial Average at tomorrow f = 方向盤角度 Recommendation f = 使用者 A 商品 B 購買可能性 Example Application

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Deep learning for Natural Language Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,

More information

arxiv: v1 [cs.lg] 11 May 2015

arxiv: v1 [cs.lg] 11 May 2015 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak JAKUB.TOMCZAK@PWR.EDU.PL Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37,

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Faster Training of Very Deep Networks Via p-norm Gates

Faster Training of Very Deep Networks Via p-norm Gates Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Language Modeling. Hung-yi Lee 李宏毅

Language Modeling. Hung-yi Lee 李宏毅 Language Modeling Hung-yi Lee 李宏毅 Language modeling Language model: Estimated the probability of word sequence Word sequence: w 1, w 2, w 3,., w n P(w 1, w 2, w 3,., w n ) Application: speech recognition

More information

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan CSE 591: Introduction to Deep Learning in Visual Computing - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan Overview Background Why another network structure? Vanishing and exploding

More information

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer Nematus: a Toolkit for Neural Machine Translation Citation for published version: Sennrich R Firat O Cho K Birch-Mayne A Haddow B Hitschler J Junczys-Dowmunt M Läubli S Miceli

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

arxiv: v1 [cs.lg] 4 May 2015

arxiv: v1 [cs.lg] 4 May 2015 Reinforcement Learning Neural Turing Machines arxiv:1505.00521v1 [cs.lg] 4 May 2015 Wojciech Zaremba 1,2 NYU woj.zaremba@gmail.com Abstract Ilya Sutskever 2 Google ilyasu@google.com The expressive power

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015 Spatial Transormer Re: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transormer Networks, NIPS, 2015 Spatial Transormer Layer CNN is not invariant to scaling and rotation

More information

Learning Long-Term Dependencies with Gradient Descent is Difficult

Learning Long-Term Dependencies with Gradient Descent is Difficult Learning Long-Term Dependencies with Gradient Descent is Difficult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua

More information

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information