Deep Learning for Natural Language Processing

Size: px
Start display at page:

Download "Deep Learning for Natural Language Processing"

Transcription

1 Deep Learning for Natural Language Processing Xipeng Qiu Fudan University 2016/5/29, CCF ADL, Beijing Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 1 / 131

2 Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 2 / 131

3 Basic Concepts Articial Intelligence Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 3 / 131

4 Basic Concepts Articial Intelligence Begin with AI Human: Memory, Computation Computer: Learning, Thinking, Creativity Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 4 / 131

5 Basic Concepts Articial Intelligence Turing Test Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 5 / 131

6 Basic Concepts Articial Intelligence Articial Intelligence Denition from Wikipedia Articial intelligence (AI) is the intelligence exhibited by machines. Colloquially, the term articial intelligence is likely to be applied when a machine uses cutting-edge techniques to competently perform or mimic cognitive functions that we intuitively associate with human minds, such as learning and problem solving. Research Topics Knowledge Representation Machine Learning Natural Language Processing Computer Vision Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 6 / 131

7 Basic Concepts Articial Intelligence Challenge: Sematic Gap Text in Human 床前明月光, 疑是地上霜 举头望明月, 低头思故乡 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 7 / 131

8 Basic Concepts Articial Intelligence Challenge: Sematic Gap Text in Computer Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 8 / 131

9 Basic Concepts Articial Intelligence Challenge: Sematic Gap Figure: Guernica (Picasso) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 9 / 131

10 Basic Concepts Machine Learning Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 10 / 131

11 Basic Concepts Machine Learning Machine Learning Input: x Model Output: y Training Data: (x, y) Learning Algorithm Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 11 / 131

12 Basic Concepts Machine Learning Basic Concepts of Machine Learning Input Data: (x i, y i ),1 i m Model: Linear Model: y = f (x) = w T x + b Generalized Linear Model: y = f (x) = w T φ(x) + b Non-linear Model: Neural Network Criterion: Loss Function: L(y, f (x)) Optimization Empirical Risk: Q(θ) = 1 m m L(y i=1 i, f (x i, θ)) Minimization Regularization: θ 2 Objective Function: Q(θ) + λ θ 2 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 12 / 131

13 Basic Concepts Machine Learning Loss Function Given an test sample (x, y), the predicted label is f (x, θ) 0-1 Loss L(y, f (x, θ)) = { 0 if y = f (x, θ) 1 if y f (x, θ) (1) = I (y f (x, θ)), (2) here I is indicator function. Quadratic Loss L(y, ŷ) = (y f (x, θ)) 2 (3) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 13 / 131

14 Basic Concepts Machine Learning Loss Function Cross-entropy Loss We regard f i (x, θ) as the conditional probability of class i. f i (x, θ) [0, 1], C f i (x, θ) = 1 (4) i=1 f y (x, θ) is the likelihood function of y. Negative Log Likelihood function is L(y, f (x, θ)) = log f y (x, θ). (5) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 14 / 131

15 Basic Concepts Machine Learning Loss Function We use one-hot vector y to represent class c in which y c = 1 and other elements are 0. Negative Log Likelihood function can be rewritten as L(y, f (x, θ)) = C y i log f i (x, θ). (6) y i is distribution of gold labels. Thus, Eq 6 is Cross Entropy Loss function. i=1 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 15 / 131

16 Basic Concepts Machine Learning Loss Function Hinge Loss For binary classication, y and f (x, θ) are in { 1, +1}. Hinge Loss is L(y, f (x, θ)) = max (0, 1 yf (x, θ)) (7) = 1 yf (x, θ) +. (8) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 16 / 131

17 Basic Concepts Machine Learning Loss Function For binary classication, y and f (x, θ) are in { 1, +1}. z = yf (x, θ). yandongl/loss.html Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 17 / 131

18 Basic Concepts Machine Learning Parameter Learning In ML, our objective is to learn the parameter θ to minimize the loss function. Gradient Descent: θ = arg min R(θ t ) (9) θ N = arg min θ 1 N i=1 L ( y (i), f (x (i), θ) ). (10) a t+1 = a t λ R(θ) θ t (11) N R ( θ t ; x (i), y (i)) = a t λ, θ (12) i=1 λ is also called Learning Rate in ML. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 18 / 131

19 Basic Concepts Machine Learning Stochastic Gradient Descent (SGD) a t+1 = a t λ R( θ t ; x (t), y (t)), (13) θ Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 19 / 131

20 Basic Concepts Machine Learning Two Tricks of SGD Early-Stop Shue Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 20 / 131

21 Basic Concepts Machine Learning Linear Classication For binary classication y {0, 1}, the classier is { 1 if w ŷ = T x > 0 0 if w T x 0 = I (wt x > 0), (14) x 1 w T x = 0 w w1 w x 2 Figure: Binary Linear Classication Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 21 / 131

22 Basic Concepts Machine Learning Logistic Regression How to learn the parameter w: Perceptron, Logistic Regression, etc. The posterior probability of y = 1 is P(y = 1 x) = σ(w T x) = exp( w T x), (15) where, σ( ) is logistic function. The posterior probability of y = 0 is P(y = 0 x) = 1 P(y = 1 x). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 22 / 131

23 Basic Concepts Machine Learning Logistic Regression Given n samples (x (i), y (i) ), 1 i N, we use the cross-entropy loss function. J (w) = N i=1 ( ( y (i) log σ(w T ) ( x (i) ) + (1 y (i) ) log 1 σ(w T )) x (i) ) (16) The gradient of J (w) is Initialize w 0 = 0, and update J (w) = N ( ( x (i) σ(w T x (i) ) y (i))) (17) w i=1 J (w) w t+1 = w t + λ w, (18) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 23 / 131

24 Basic Concepts Machine Learning Multiclass Classication Generally, y = {1,, C}, we dene C discriminant functions where w c is weight vector of class c. Thus, f c (x) = w T c x, c = 1,, C, (19) ŷ = C arg max c=1 w T c x (20) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 24 / 131

25 Basic Concepts Machine Learning Softmax Regression Softmax regression is a generalization of logistic regression to multi-class classication problems. With softmax, the posterior probability of y = c is P(y = c x) = softmax(w T c x) = exp(w c x) C i=1 exp(w i x). (21) To represent class c by one-hot vector where I () is indictor function. y = [I (1 = c), I (2 = c),, I (C = c)] T, (22) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 25 / 131

26 Basic Concepts Machine Learning Softmax Regression Redene Eq 21, ŷ = softmax(w T x) = exp(w T x) 1 T exp ((W T x)) = exp(z) 1T, (23) exp (z) where, W = [w 1,, w C ], ŷ is predicted posterior probability, ẑ = W T x is input of softmax function. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 26 / 131

27 Basic Concepts Machine Learning Softmax Regression Given training set (x (i), y (i) ), 1 i N, the cross-entropy loss is J (W ) = The gradient of J (W ) is N C i=1 c=1 y c (i) log ŷ c (i) = N (y (i) ) T log ŷ (i) i=1 J (W ) w c N ( = x (i) y (i) ŷ (i)) c i=1 (24) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 27 / 131

28 Basic Concepts Machine Learning The idea pipeline of NLP Word Segmentation POS Tagging Applications: Question Answering Machine Translation Sentiment Analysis Syntactic Parsing Semantic Parsing Knowledge Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 28 / 131

29 Basic Concepts Machine Learning But in practice: End-to-End I like this movie. I dislike this movie. Model + Model Selection Decoding/Inference Feature Extraction Parameter Learning Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 29 / 131

30 Basic Concepts Machine Learning Feature Extraction Bag-of-Word Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 30 / 131

31 Basic Concepts Machine Learning Text Classication Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 31 / 131

32 Basic Concepts Deep Learning Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 32 / 131

33 Basic Concepts Deep Learning Articial Neural Network Articial neural networks 1 (ANNs) are a family of models inspired by biological neural networks (the central nervous systems of animals, in particular the brain). Articial neural networks are generally presented as systems of interconnected neurons which exchange messages between each other. 1 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 33 / 131

34 Basic Concepts Deep Learning Articial Neuron Input: x = (x 1, x 2,, x n ) State: z Output: a z = w x + b (25) a = f (z) (26) 1 b x 1 w 1 x 2 w 2 σ 0/1. w n x n weight Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 34 / 131

35 Basic Concepts Deep Learning Activation Function Sigmoid Function: σ(x) = e x (27) tanh(x) = ex e x e x + e x (28) tanh(x) = 2σ(2x) 1 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 35 / 131

36 Basic Concepts Deep Learning Activation Function rectier function 2, also called rectied linear unit (ReLU) 3. rectier(x) = max(0, x) (29) softplus function 4 : softplus(x) = log(1 + e x ) (30) 2 X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectier neural networks. In: International Conference on Articial Intelligence and Statistics. 2011, pp V. Nair and G. E. Hinton. Rectied linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010, pp C. Dugas et al. Incorporating second-order functional knowledge for better option pricing. In: Advances in Neural Information Processing Systems (2001). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 36 / 131

37 Basic Concepts Deep Learning Activation Function (a) logistic (b) tanh (c) rectier (d) softplus Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 37 / 131

38 Basic Concepts Deep Learning Types of Articial Neural Network 5 Feedforward neural network, also called Multilayer Perceptron (MLP). Recurrent neural network. 5 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 38 / 131

39 Basic Concepts Deep Learning Basic Concepts of Deep Learning Model: Articial neural networks that consist of multiple hidden non-linear layers. Function: Non-linear function y = σ( i w ix i + b). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 39 / 131

40 Basic Concepts Deep Learning Feedforward Neural Network In feedforward neural network, the information moves in only one direction forward: From the input nodes data goes through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network. Input Hidden Hidden Output x 1 x 2 y x 3 x 4 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 40 / 131

41 Basic Concepts Deep Learning Feedforward Computing Denitions: L Number of Layers; n l Number of neurons in l-th layer; f l ( ) Activation function in l-th layer; W (l) R nl n l 1 weight matrix between l 1-th layer and l-th layer; b (l) R nl bias vector between l 1-th layer and l-th layer; z (l) R nl state vector of neurons in l-th layer; a (l) R nl activation vector of neurons in l-th layer. z (l) = W (l) a (l 1) + b (l) (31) a (l) = f l (z (l) ) (32) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 41 / 131

42 Basic Concepts Deep Learning Feedforward Computing Thus, z (l) = W (l) f l (z (l 1) ) + b (l) (33) x = a (0) z (1) a (1) z (2) a (L 1) z (L) a (L) = y (34) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 42 / 131

43 Basic Concepts Deep Learning Combining feedforward network and Machine Learning Given training samples (x (i), y (i) ), 1 i N, and feedforward network f (x w, b), the objective function is J(W, b) = = N L(y (i), f (x (i) W, b)) λ W 2 F, (35) i=1 N J(W, b; x (i), y (i) ) λ W 2 F, (36) i=1 where W 2 F = L n l+1 n l l=1 j=1 j= W (l) ij. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 43 / 131

44 Basic Concepts Deep Learning Learning by GD W (l) = W (l) J(W, b) α W (l), (37) = W (l) α N i=1 ( J(W, b; x(i), y (i) ) W (l) ) λw, (38) b (l) = b (l) α J(W, b; x(i), y (i) ) b (l), (39) N = b (l) α ( J(W, b; x(i), y (i) ) b (l) ), (40) i=1 (41) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 44 / 131

45 Basic Concepts Deep Learning Backpropogation How to compute J(W,b;x,y) W (l) ij J(W,b;x,y) W (l)? J(W, b; x, y) W (l) ij ( ) J(W, b; x, y) z (l) = z (l) W (l). (42) ij We dene δ (l) = J(W,b;x,y) z (l) R n(l). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 45 / 131

46 Basic Concepts Deep Learning Backpropogation Because z (l) = W (l) a (l 1) + b (l), z (l) W (l) ij = (W (l) a (l 1) + b (l) ) W (l) ij = 0. a (l 1) j. 0. i-th row(43) Therefore, J(W, b; x, y) W (l) ij = δ (l) i a (l 1) j (44) (45) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 46 / 131

47 Basic Concepts Deep Learning Backpropogation J(W, b; x, y) W (l) = δ (l) (a (l 1) ). (46) In the same way, J(W, b; x, y) b (l) = δ (l). (47) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 47 / 131

48 Basic Concepts Deep Learning How to compute δ (l)? δ (l) J(W, b; x, y) z (l) (48) = a(l) z(l+1) (l) J(W, b; x, y) z (l+1) (49) z a (l) = diag(f l )) (W (l+1) ) δ (l+1) (50) = f l ) ((W (l+1) ) δ (l+1) ), (51) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 48 / 131

49 Basic Concepts Deep Learning Backpropogation Algorithm Input Output: W, b 1 Initialize W, b ; : Training Set: (x (i), y (i) ), i = 1,, N, Iteration: T 2 for t = 1 T do 3 for i = 1 N do 4 (1) Feedforward Computing; 5 (2) Compute δ (l) by 51; 6 (3) Compute gradient of parameters by 46 47; 7 8 J(W,b;x (i),y (i) ) 9 (4) Update Parameter; = δ (l) W (l) (a (l 1) ) ; J(W (i),b;x,y (i) ) = δ (l) ; b (l) 10 W (l) = W (l) α N i=1 ( J(W,b;x(i),y (i) ) W (l) ) λw ; 11 b (l) = b (l) α N i=1 ( J(W,b;x(i),y (i) ) b (l) ); 12 end 13 end Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 49 / 131

50 Basic Concepts Deep Learning Figure: Gradient Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 50 / 131 Gradient Vanishing δ (l) = f l (z(l) ) ((W (l+1) ) δ (l+1), (52) When we use sigmoid function, such as logistic σ(x) and tanh, σ (x) = σ(x)(1 σ(x)) [0, 0.25] (53) tanh (x) = 1 (tanh(x)) 2 [0, 1]. (54) (e) logistic (f) tanh

51 Basic Concepts Deep Learning Diculties Huge Parameters Non-convex Optimization Gradient Vanishing Poor Interpretability Requirments High Computation Big Data Good Algorithms Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 51 / 131

52 Basic Concepts Deep Learning Tricks and Skills 6 7 Use ReLU non-linearities Use cross-entropy loss for classication SDG+mini-batch Shue the training samples ( very important) Early-Stop Normalize the input variables (zero mean, unit variance) Schedule to decrease the learning rate Use a bit of L1 or L2 regularization on the weights (or a combination) Use dropout for regularization Data Argument 6 G. B. Orr and K.-R. M ller. Neural networks: tricks of the trade. Springer, Geo Hinton, Yoshua Bengio & Yann LeCun, Deep Learning, NIPS 2015 Tutorial. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 52 / 131

53 Neural Models for Representation Learning General Architecture Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 53 / 131

54 Neural Models for Representation Learning General Architecture General Neural Architectures for NLP How to use neural network for the NLP tasks? Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131

55 Neural Models for Representation Learning General Architecture General Neural Architectures for NLP How to use neural network for the NLP tasks? Distributed Representation Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 54 / 131

56 Neural Models for Representation Learning General Architecture General Neural Architectures for NLP 8 1 represent the words/features with dense vectors (embeddings) by lookup table; 2 concatenate the vectors; 3 multi-layer neural networks. classication matching ranking 8 R. Collobert et al. Natural language processing (almost) from scratch. In: The Journal of Machine Learning Research (2011). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 55 / 131

57 Neural Models for Representation Learning General Architecture Dierence with the traditional methods Traditional methods Neural methods Discrete Vector Dense Vector Features (One-hot Representation) (Distributed Representation) High-dimension Low-dimension Classier Linear Non-Linear Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 56 / 131

58 Neural Models for Representation Learning General Architecture The key point is how to encode the word, phrase, sentence, paragraph, or even document into the distributed representation? Representation Learning Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 57 / 131

59 Neural Models for Representation Learning General Architecture Representation Learning for NLP Word Level NNLM C&W CBOW & Skip-Gram Sentence Level NBOW Sequence Models: Recurrent NN (LSTM/GRU), Paragraph Vector Topoligical Models: Recursive NN, Convolutional Models: Convolutional NN Document Level NBOW Hierachical Models two-level CNN Sequence Models LSTM, Paragraph Vector Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 58 / 131

60 Neural Models for Representation Learning General Architecture Let's start with Language Model A statistical language model is a probability distribution over sequences of words. A sequence W consists of L words. n-gram model: P(W ) = P(w 1:L) = P(w 1,, w L ) = P(w 1 )P(w 2 w 1 )P(w 3 w 1 w 2 ) P(w L w 1:(L 1)) L = P(w i w 1:(i 1)). (55) i=1 P(W ) = L P(w i w (i n+1):(i 1)). (56) i=1 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 59 / 131

61 Neural Models for Representation Learning General Architecture Neural Probabilistic Language Model 9 turn unsupervised learning into supervised learning; avoid the data sparsity of n-gram model; project each word into a low dimensional space. 9 Y. Bengio, R. Ducharme, and P. Vincent. A Neural probabilistic language model. In: Journal of Machine Learning Research (2003). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 60 / 131

62 Neural Models for Representation Learning General Architecture Problem of Very Large Vocabulary Softmax output: P h θ = exp(s θ (w, h)) w exp(s θ(w, h)), (57) Unfortunately both evaluating Pθ h and computing the corresponding likelihood gradient requires normalizing over the entire vocabulary Hierarchical Softmax: a tree-structured vocabulary 10 Negative Sampling 11, noise-contrastive estimation (NCE) A. Mnih and G. Hinton. A scalable hierarchical distributed language model. In: Advances in neural information processing systems (2009); F. Morin and Y. Bengio. Hierarchical Probabilistic Neural Network Language Model. In: Aistats. Vol. 5. Citeseer. 2005, pp T. Mikolov et al. Ecient estimation of word representations in vector space. In: arxiv preprint arxiv: (2013). 12 A. Mnih and K. Kavukcuoglu. Learning word embeddings eciently with noise-contrastive estimation. In: Advances in Neural Information Processing Systems. 2013, pp Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 61 / 131

63 Neural Models for Representation Learning General Architecture Linguistic Regularities of Word Embeddings T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. In: HLT-NAACL. 2013, pp Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 62 / 131

64 Neural Models for Representation Learning General Architecture Skip-Gram Model Mikolov et al., Ecient estimation of word representations in vector space. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 63 / 131

65 Neural Models for Representation Learning General Architecture Skip-Gram Model Given a pair of words (w, c), the probability that the word c is observed in the context of the target word w is given by Pr(D = 1 w, c) = exp( w T c), where w and c are embedding vectors of w and c respectively. The probability of not observing word c in the context of w is given by, Pr(D = 0 w, c) = exp( w T c). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 64 / 131

66 Neural Models for Representation Learning General Architecture Skip-Gram Model with Negative Sampling Given a training set D, the word embeddings are learned by maximizing the following objective function: J(θ) = Pr(D = 1 w, c) + Pr(D = 0 w, c), w,c D w,c D where the set D is randomly sampled negative examples, assuming they are all incorrect. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 65 / 131

67 Neural Models for Representation Learning Convolutional Neural Network Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 66 / 131

68 Neural Models for Representation Learning Convolutional Neural Network Convolutional Neural Network Convolutional neural network (CNN, or ConvNet) is a type of feed-forward articial neural network. Distinguishing features 15 : 1 Local connectivity 2 Shared weights 3 Subsampling 15 Y. LeCun et al. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE 11 (1998). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 67 / 131

69 Neural Models for Representation Learning Convolutional Neural Network Convolution Convolution is an integral that expresses the amount of overlap of one function as it is shifted over another function. Given an input sequence x t, t = 1,, n and a lter f t, t = 1,, m, the convolution is n y t = f k x t k+1. (58) k=1 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 68 / 131

70 Neural Models for Representation Learning Convolutional Neural Network One-dimensional convolution 15 Figure from: Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 69 / 131

71 Neural Models for Representation Learning Convolutional Neural Network Two-dimensional convolution Given an image x ij, 1 i M, 1 j N and lter f ij, 1 i m, 1 j n, the convolution is Mean lter:f uv = 1 mn y ij = m u=1 v=1 n f uv x i u+1,j v+1. (59) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 70 / 131

72 Neural Models for Representation Learning Convolutional Neural Network Convolutional Layer a (l) = f (w (l) a (l 1) + b (l) ), (60) is convolutional operation. w (l) is shared by all the neurons of l-th layer. Just need m + 1 parameters, and n (l+1) = n (l) m + 1. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 71 / 131

73 Neural Models for Representation Learning Convolutional Neural Network Fully Connected Layer V.S. Convolutional Layer (a) Fully Connected Layer (b) Convolutional Layer Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 72 / 131

74 Neural Models for Representation Learning Convolutional Neural Network Pooling Layer It is common to periodically insert a pooling layer in-between successive convolutional layers. progressively reduce the spatial size of the representation reduce the amount of parameters and computation in the network avoid overtting For a feature map X (l), we divide it into several (non-)overlapped regions R k, k = 1,, K. A pooling function down( ) is X (l+1) k = f (Z (l+1) k ), (61) ( = f w (l+1) down(r k ) + b (l+1)). (62) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 73 / 131

75 Neural Models for Representation Learning Convolutional Neural Network Pooling Layer X (l+1) = f (Z (l+1) ), (63) ( = f w (l+1) down(x l ) + b (l+1)), (64) Two choices of down( ): Maximum Pooling and Average Pooling. pool max (R k ) = max a i i R k (65) pool avg (R k ) = 1 a i. R k i R k (66) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 74 / 131

76 Neural Models for Representation Learning Convolutional Neural Network Pooling Layer 15 Figure from: Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 75 / 131

77 Neural Models for Representation Learning Convolutional Neural Network Large Scale Visual Recognition Challenge Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 76 / 131

78 Neural Models for Representation Learning Convolutional Neural Network DeepMind's AlphaGo 15 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 77 / 131

79 Neural Models for Representation Learning Convolutional Neural Network DeepMind's AlphaGo 15 Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 78 / 131

80 Neural Models for Representation Learning Convolutional Neural Network CNN for Sentence Modeling Input: A sentence of length n, After Lookup layer, X = [x 1, x 2,, x n ] R d n variable-length input convolution pooling Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 79 / 131

81 Neural Models for Representation Learning Convolutional Neural Network CNN for Sentence Modeling 16 Key steps convolution z t:t+m 1 = x t x t+1 x t+m 1 R dm matrix-vector operation x l t = f (W l z t:t+m 1 + b l ) Pooling (max over time) x l i = max t x l 1 i,t 16 Collobert et al., Natural language processing (almost) from scratch. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 80 / 131

82 Neural Models for Representation Learning Convolutional Neural Network CNN for Sentence Modeling 17 Key steps convolution z t:t+m 1 = x t x t+1 x t+m 1 R dm vector-vector operation x l t = f (w l z t:t+m 1 + b l ) multiple lters / multiple channels pooling (max over time) x l i = max t x l 1 i,t 17 Y. Kim. Convolutional neural networks for sentence classication. In: arxiv preprint arxiv: (2014). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 81 / 131

83 Neural Models for Representation Learning Convolutional Neural Network Dynamic CNN for Sentence Modeling 18 Key steps one-dimensional convolution x l i,t = f (wl i x i,t:t+m 1 + bi l) k-max pooling (max over time) k l = max(k top, L l L n) (optional) folding: sums every two rows in a feature map component-wise. multiple lters / multiple channels 18 N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A Convolutional Neural Network for Modelling Sentences. In: Proceedings of ACL Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 82 / 131

84 Neural Models for Representation Learning Convolutional Neural Network CNN for Sentence Modeling 19 Key steps convolution z t:t+m 1 = x t x t+1 x t+m 1 R dm matrix-vector operation x l t = f (W l z l 1 t:t+m 1 + bl ) binary max pooling (over time) x l i,t = max(xl 1 i,2t 1, xl 1 i,2t ) 19 B. Hu et al. Convolutional neural network architectures for matching natural language sentences. In: Advances in Neural Information Processing Systems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 83 / 131

85 Neural Models for Representation Learning Recurrent Neural Network Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 84 / 131

86 Neural Models for Representation Learning Recurrent Neural Network Recurrent Neural Network (RNN) Output h t Hidden x t h t Delay The RNN has recurrent hidden states whose output at each time is dependent on that of the previous time. More formally, given a sequence x (1:n) = (x (1),..., x (t),..., x (n) ), the RNN updates its recurrent hidden state h (t) by Input h t 1 h t = { 0 t = 0 f (h t 1, x t ) otherwise (67) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 85 / 131

87 Neural Models for Representation Learning Recurrent Neural Network Simple Recurrent Network 20 where f is non-linear function. h t = f (Uh t 1 + Wx t + b), (68) y 1 y 2 y 3 y 4 y T h 1 h 2 h 3 h 4 h T x 1 x 2 x 3 x 4 x T 20 J. L. Elman. Finding structure in time. In: Cognitive science 2 (1990). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 86 / 131

88 Neural Models for Representation Learning Recurrent Neural Network Backpropagation Through Time, BPTT Loss at time t is J t, the whole loss is J = T t=1 J t. The gradient of J is J T U = = t=1 T t=1 J t U (69) h t J t, (70) U h t J 1 J 2 J 3 J 4 J T h 1 h 2 h 3 h 4 h T x 1 x 2 x 3 x 4 x T Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 87 / 131

89 Neural Models for Representation Learning Recurrent Neural Network Gradient of RNN J T U = J T U = h t h k = t t=1 k=1 = h k U t t=1 k=1 t i=k+1 t i=k+1 ( t i=k+1 h k U h t h k y t h t J t y t, (71) h i h i 1 (72) U T diag[f (h i 1 )]. (73) U T diag(f (h i 1 )) ) y t h t J t y t. (74) Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 88 / 131

90 Neural Models for Representation Learning Recurrent Neural Network Long-Term Dependencies Dene γ = U T diag(f (h i 1 )), Exploding Gradient Problem: When γ > 1 and t k, γ t k. Vanishing Gradient Problem: When γ < 1 and t k, γ t k 0. There are various ways to solve Long-Term Dependency problem. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 89 / 131

91 Neural Models for Representation Learning Recurrent Neural Network Long Short Term Memory Neural Network (LSTM) 21 The core of the LSTM model is a memory cell c encoding memory at every time step of what inputs have been observed up to this step. The behavior of the cell is controlled by three gates: input gate i output gate o forget gate f i t = σ(w i x t + U i h t 1 + V i c t 1 ), (75) f t = σ(w f x t + U f h t 1 + V f c t 1 ), (76) o t = σ(w o x t + U o h t 1 + V o c t ), (77) c t = tanh(w c x t + U c h t 1 ), (78) c t = f t c t 1 + i t c t, (79) h t = o t tanh(c t ), (80) 21 S. Hochreiter and J. Schmidhuber. Long short-term memory. In: Neural computation 8 (1997). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 90 / 131

92 Neural Models for Representation Learning Recurrent Neural Network LSTM Architecture c t 1 + c t h t 1 f t i t ~c t o t h t σ σ tanh σ x t Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 91 / 131

93 Neural Models for Representation Learning Recurrent Neural Network Gated Recurrent Unit, GRU 22 Two gates: update gate z and reset gate r. r t = σ(w r x t + U r h t 1 ) (81) z t = σ(w z x t + U z h t 1 ) (82) h t = tanh(w c x t + U(r t h t 1 )), (83) h t = z t h t 1 + (1 z t ) h t, (84) (85) 22 K. Cho et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: arxiv preprint arxiv: (2014); J. Chung et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. In: arxiv preprint arxiv: (2014). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 92 / 131

94 Neural Models for Representation Learning Recurrent Neural Network GRU Architecture h t 1 + h t 1 z t r t tanh ~ ht σ σ x t Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 93 / 131

95 Neural Models for Representation Learning Recurrent Neural Network Stacked RNN y 1 y 2 y 3 y 4 y T h (3) 1 h (3) 2 h (3) 3 h (3) 4 h (3) T h (2) 1 h (2) 2 h (2) 3 h (2) 4 h (2) T h (1) 1 h (1) 2 h (1) 3 h (1) 4 h (1) T x 1 x 2 x 3 x 4 x T Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 94 / 131

96 Neural Models for Representation Learning Recurrent Neural Network Bidirectional RNN h (1) t = f (U (1) h (1) t 1 + W(1) x t + b (1) ), (86) h (2) t = f (U (2) h (2) t+1 + W(2) x t + b (2) ), (87) h t = h (1) t h (2) t. (88) y 1 y 2 y 3 y 4 y T h (2) 1 h (2) 2 h (2) 3 h (2) 4 h (2) T h (1) 1 h (1) 2 h (1) 3 h (1) 4 h (1) T x 1 x 2 x 3 x 4 x T Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 95 / 131

97 Neural Models for Representation Learning Recurrent Neural Network Application of RNN: Sequence to Category Text Classication Sentiment Classication h y y h1 h2 h T h1 h2 h T x1 x2 x T (c) Mean x1 x2 x T (d) Last Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 96 / 131

98 Neural Models for Representation Learning Recurrent Neural Network Application of RNN: Synchronous Sequence to Sequence Sequence Labeling, such as Chinese word segmentation, POS tagging y 1 y 2 y T h 1 h 2 h T x 1 x 2 x T Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 97 / 131

99 Neural Models for Representation Learning Recurrent Neural Network Application of RNN: Asynchronous Sequence to Sequence Machine Translation y 1 y 2 y M h 1 h 2 h T h T +1 h T +2 h T +M x 1 x 2 x T < EOS > Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 98 / 131

100 Neural Models for Representation Learning Recurrent Neural Network Unfolded LSTM for Text Classication h 1 h 2 h 3 h 4 h T softmax x 1 x 2 x 3 x 4 x T y Drawback: long-term dependencies need to be transmitted one-by-one along the sequence. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 99 / 131

101 Neural Models for Representation Learning Recurrent Neural Network Unfolded Multi-Timescale LSTM 23 g 3 1 g 3 2 g 3 3 g 3 4 g 3 T g 2 1 g 2 2 g 2 3 g 2 4 g 2 T softmax g 1 1 g 1 2 g 1 3 g 1 4 g 1 T y x 1 x 2 x 3 x 4 x T 23 P. Liu et al. Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 100 / 131

102 Neural Models for Representation Learning Recurrent Neural Network LSTM for Sentiment Analysis LSTM MT-LSTM LSTM MT-LSTM <s> Is this progress? </s> <s> He d create a movie better than this. </s> LSTM MT-LSTM <s> It s not exactly a gourmetmeal but the fare is fair, even coming from the drive. </s> Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 101 / 131

103 Neural Models for Representation Learning Recurrent Neural Network Paragraph Vector Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In: arxiv preprint arxiv: (2014). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 102 / 131

104 Neural Models for Representation Learning Recurrent Neural Network Memory Mechanism What dierences among the various models from memory view? Short-term Long short-term Global External SRN Yes No No No LSTM/GRU Yes Yes Maybe No PV Yes Yes Yes No NTM/DMN Yes Yes Maybe Yes Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 103 / 131

105 Neural Models for Representation Learning Recursive Neural Network Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 104 / 131

106 Neural Models for Representation Learning Recursive Neural Network Recursive Neural Network (RecNN) 25 a red bike,np red bike,np a,det red,jj bike,nn Topological models compose the sentence representation following a given topological structure over the words. Given a labeled binary parse tree, ((p 2 ap 1 ), (p 1 bc)), the node representations are computed by [ ] b p 1 = f (W ), c [ a p 2 = f (W p1 ] ). 25 R. Socher et al. Parsing with compositional vector grammars. In: Proceedings of ACL Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 105 / 131

107 Neural Models for Representation Learning Recursive Neural Network Recursive Convolutional Neural Network 26 Recursive neural network can only process the binary combination and is not suitable for dependency parsing. a red bike,nn Pooling Convolution a bike,nn red bike,nn a,det red,jj bike,nn introducing the convolution and pooling layers; modeling the complicated interactions of the head word and its children. 26 C. Zhu et al. A Re-Ranking Model For Dependency Parser With Recursive Convolutional Neural Network. In: Proceedings of Annual Meeting of the Association for Computational Linguistics Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 106 / 131

108 Neural Models for Representation Learning Recursive Neural Network Tree-Structured LSTMs 27 Natural language exhibits syntactic properties that would naturally combine words to phrases. Child-Sum Tree-LSTMs N-ary Tree-LSTMs 27 K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In: arxiv preprint arxiv: (2015). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 107 / 131

109 Neural Models for Representation Learning Recursive Neural Network Gated Recursive Neural Network 28 B M E S 下 Rainy 雨 天 Day 地 Ground 面 积 水 Accumulated water DAG based Recursive Neural Network Gating mechanism An relative complicated solution GRNN models the complicated combinations of the features, which selects and preserves the useful combinations via reset and update gates. 28 X. Chen et al. Gated Recursive Neural Network For Chinese Word Segmentation. In: Proceedings of Annual Meeting of the Association for Computational Linguistics. 2015; X. Chen et al. Sentence Modeling with Gated Recursive Neural Network. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 108 / 131

110 Neural Models for Representation Learning Attention Model Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 109 / 131

111 Neural Models for Representation Learning Attention Model Attention Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 110 / 131

112 Neural Models for Representation Learning Attention Model Attention Model 29 1 The context vector c i is computed as a weighted sum of h i : c i = T α ij h j j=1 2 The weight α ij is computed by α ij = softmax(v T tanh(ws i 1 +Uh j )) 29 D. Bahdanau, K. Cho, and Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In: ArXiv e-prints (Sept. 2014). arxiv: [cs.cl]. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 111 / 131

113 Applications Question Answering Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 112 / 131

114 Applications Question Answering Question Answering Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 113 / 131

115 Applications Question Answering LSTM K. M. Hermann et al. Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems. 2015, pp Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 114 / 131

116 Applications Question Answering Dynamic Memory Networks A. Kumar et al. Ask me anything: Dynamic memory networks for natural language processing. In: arxiv preprint arxiv: (2015). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 115 / 131

117 Applications Question Answering Memory Networks S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In: Advances in Neural Information Processing Systems. 2015, pp Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 116 / 131

118 Applications Question Answering Neural Reasoner B. Peng et al. Towards Neural Network-based Reasoning. In: arxiv preprint arxiv: (2015). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 117 / 131

119 Applications Machine Translation Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 118 / 131

120 Applications Machine Translation Sequence to Sequence Machine Translation 34 Neural machine translation is a recently proposed framework for machine translation based purely on neural networks. 34 I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems. 2014, pp Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 119 / 131

121 Applications Machine Translation Image Caption A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp O. Vinyals et al. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp K. Xu et al. Show, attend and tell: Neural image caption generation with visual attention. In: arxiv preprint arxiv: (2015). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 120 / 131

122 Applications Machine Translation Image Caption Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 121 / 131

123 Applications Text Matching Table of Contents 1 Basic Concepts Articial Intelligence Machine Learning Deep Learning 2 Neural Models for Representation Learning General Architecture Convolutional Neural Network Recurrent Neural Network Recursive Neural Network Attention Model 3 Applications Question Answering Machine Translation Text Matching 4 Challenges & Open Problems Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 122 / 131

124 Applications Text Matching Text Matching Among many natural language processing (NLP) tasks, such as text classication, question answering and machine translation, a common problem is modelling the relevance/similarity of a pair of texts, which is also called text semantic matching. Three types of interaction models: Weak interaction Models Semi-interaction Models Strong Interaction Models Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 123 / 131

125 Applications Text Matching Weak interaction Models Some early works focus on sentence level interactions, such as ARC-I 38, CNTN 39 and so on. These models rst encode two sequences into continuous dense vectors by separated neural models, and then compute the matching score based on sentence encoding. question f + + answer Matching Score Figure: Convolutional Neural Tensor Network 38 Hu et al., Convolutional neural network architectures for matching natural language sentences. 39 X. Qiu and X. Huang. Convolutional Neural Tensor Network Architecture for Community-based Question Answering. In: Proceedings of International Joint Conference on Articial Intelligence Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 124 / 131

126 Applications Text Matching Semi-interaction Models Another kind of models use soft attention mechanism to obtain the representation of one sentence by depending on representation of another sentence, such as ABCNN 40, Attention LSTM 41. These models can alleviate the weak interaction problem to some extent. Figure: Attention LSTM W. Yin et al. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. In: arxiv preprint arxiv: (2015). 41 Hermann et al., Teaching machines to read and comprehend. 42 T. Rockt schel et al. Reasoning about Entailment with Neural Attention. In: arxiv preprint arxiv: (2015). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 125 / 131

127 Applications Text Matching Strong Interaction Models The models build the interaction at dierent granularity (word, phrase and sentence level), such as ARC-II 43, MV-LSTM 44, coupled-lstms 45. The nal matching score depends on these dierent levels of interactions. Figure: coupled-lstms 43 Hu et al., Convolutional neural network architectures for matching natural language sentences. 44 S. Wan et al. A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. In: AAAI P. Liu, X. Qiu, and X. Huang. Modelling Interaction of Sentence Pair with coupled-lstms. In: arxiv preprint arxiv: (2016). Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 126 / 131

128 Challenges & Open Problems Conclusion of DL4NLP (just kidding) Long long ago: you must know intrinsic rules of data. Past ten years: you just know eective features of data. Nowadays: you just need to have big data. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 127 / 131

129 Challenges & Open Problems Challenges & Open Problems Depth of network Rare words Fundamental data structure Long-term dependencies Natural language understanding & reasoning Biology inspired model Unsupervised learning Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 128 / 131

130 Challenges & Open Problems DL4NLP from scratch Select a real problem Translate your problem to (supervised) learning problem Prepare your data and hardware (GPU) Select a library: Theano, Keras, TensorFlow Find an open source implementation Incrementally writing your own code Run it. Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 129 / 131

131 Challenges & Open Problems More Information 神经网络与深度学习 最新讲义 : Xipeng Qiu (Fudan University) Deep Learning for Natural Language Processing 130 / 131

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Deep learning for Natural Language Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

GloVe: Global Vectors for Word Representation 1

GloVe: Global Vectors for Word Representation 1 GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background

More information

Deep Learning for NLP Part 2

Deep Learning for NLP Part 2 Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST 1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second

More information

Neural Networks for NLP. COMP-599 Nov 30, 2016

Neural Networks for NLP. COMP-599 Nov 30, 2016 Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Deep Learning Architectures and Algorithms

Deep Learning Architectures and Algorithms Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015 Spatial Transormer Re: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transormer Networks, NIPS, 2015 Spatial Transormer Layer CNN is not invariant to scaling and rotation

More information

Part-of-Speech Tagging + Neural Networks 2 CS 287

Part-of-Speech Tagging + Neural Networks 2 CS 287 Part-of-Speech Tagging + Neural Networks 2 CS 287 Review: Bilinear Model Bilinear model, ŷ = f ((x 0 W 0 )W 1 + b) x 0 R 1 d 0 start with one-hot. W 0 R d 0 d in, d 0 = F W 1 R d in d out, b R 1 d out

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

arxiv: v3 [cs.cl] 30 Jan 2016

arxiv: v3 [cs.cl] 30 Jan 2016 word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

arxiv: v1 [cs.cl] 31 May 2015

arxiv: v1 [cs.cl] 31 May 2015 Recurrent Neural Networks with External Memory for Language Understanding Baolin Peng 1, Kaisheng Yao 2 1 The Chinese University of Hong Kong 2 Microsoft Research blpeng@se.cuhk.edu.hk, kaisheny@microsoft.com

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Gated Recurrent Neural Tensor Network

Gated Recurrent Neural Tensor Network Gated Recurrent Neural Tensor Network Andros Tjandra, Sakriani Sakti, Ruli Manurung, Mirna Adriani and Satoshi Nakamura Faculty of Computer Science, Universitas Indonesia, Indonesia Email: andros.tjandra@gmail.com,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Structured Neural Networks (I)

Structured Neural Networks (I) Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this

More information

Feature Design. Feature Design. Feature Design. & Deep Learning

Feature Design. Feature Design. Feature Design. & Deep Learning Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Neural Word Embeddings from Scratch

Neural Word Embeddings from Scratch Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings

More information

Seq2Tree: A Tree-Structured Extension of LSTM Network

Seq2Tree: A Tree-Structured Extension of LSTM Network Seq2Tree: A Tree-Structured Extension of LSTM Network Weicheng Ma Computer Science Department, Boston University 111 Cummington Mall, Boston, MA wcma@bu.edu Kai Cao Cambia Health Solutions kai.cao@cambiahealth.com

More information

Recurrent Neural Networks

Recurrent Neural Networks Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Recurrent Neural Networks Neural Networks and Deep Learning, Springer, 218 Chapter 7.1 7.2 The Challenges of Processing Sequences Conventional

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Language Modeling. Hung-yi Lee 李宏毅

Language Modeling. Hung-yi Lee 李宏毅 Language Modeling Hung-yi Lee 李宏毅 Language modeling Language model: Estimated the probability of word sequence Word sequence: w 1, w 2, w 3,., w n P(w 1, w 2, w 3,., w n ) Application: speech recognition

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Neural Networks and Deep Learning.

Neural Networks and Deep Learning. Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

More information