Deep Learning: A Statistical Perspective

Size: px
Start display at page:

Download "Deep Learning: A Statistical Perspective"

Transcription

1 Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June, 2018 Seoul National University Deep Learning March-June, / 56

2 Introduction Introduction Seoul National University Deep Learning March-June, / 56

3 Introduction Natural Language Processing Natural Language Processing (NLP) includes: Sentiment analysis Machine translation Text generation... How to train language? How can we convert language into numbers? Seoul National University Deep Learning March-June, / 56

4 Introduction Word Embedding How to map words in to R d? One-hot encoding Each vectors has nothing to do with other vectors u v, u v = 1, u T v = 0 However... Each word is related with its companies. Ice is closer to Solid than Gas Seoul National University Deep Learning March-June, / 56

5 Introduction Main Questions in Word Embedding Vocabulary set: V = {a, the,deep,statistics,...,} Size N corpus: C = (v (1), v (2),..., v (N) ), v (1), v (2),..., v (N) V Given the corpus data, how can we measure similarity between words, sim(deep, statistics)? How can we define f and learn w deep, w statistics such that sim(deep, statistics) = f (w deep, w statistics )? Seoul National University Deep Learning March-June, / 56

6 Introduction Some Famous Word Embedding Techniques Latent Semantic Analysis (LSA) (Deerwester et al. 1990) Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996) Word2Vec (Mikolov et al. 2013a) GloVe (Pennington et al. 2014) Seoul National University Deep Learning March-June, / 56

7 Introduction LSA (Deerwester et al. 1990) Term-document matrix: X t d sim(a, b) co-occurence in each documents. Sigular Value Decomposition: X t d = TSD T With k largest singular values: X t d = T t k S k k (D d k ) T T t k : k-dim term vectors, D d k : k-dim document vectors Figure: from (Deerwester et al. 1990) Seoul National University Deep Learning March-June, / 56

8 Introduction HAL (Lund and Burgess, 1996) Term-context term matrix: X V V How many times the column word appear in front of the row term? sim(a, b) co-occurence in nearby context. Concatenate row/column to make 2V -dim vector Dimension reduction with k principal components. Train 160M terms, with V = 70, 000 Figure: from (Lund and Burgess, 1996) Seoul National University Deep Learning March-June, / 56

9 Introduction Sim(, ) Cooccurence? Cooccurence with and or the does not mean semantic similarity. Appears just frequently? or has significant similarity? Transformation or defining new measure of similarity. Entropy/Correlation based normalization (Rohde et al., 2006) Positive pointwise mutual information(ppmi) max{0, log p(context term) p(context) } (Bullinaria and Levy, 2007) Square root type transformation (Lebret and Collobert, 2014) Train p(context term) within every local window. (Word2Vec) Seoul National University Deep Learning March-June, / 56

10 Word2Vec Word2Vec Seoul National University Deep Learning March-June, / 56

11 Word2Vec Model Setup (Mikolov et al., 2013) Vocabulary set: V = {e 1, e 2,..., e V } {0, 1} V Size N corpus: C = (v (1), v (2),..., v (N) ), v (1), v (2),..., v (N) V Embedded word vectors: W d V = w 1 w 2 w V, W d V = w 1 w 2 w V Seoul National University Deep Learning March-June, / 56

12 Word2Vec Model Setup (Mikolov et al., 2013) Thus, the model becomes: P(v (output) v (input) ) = exp(w output w input ) V j=1 exp(w j w input) W /W is called input/output representation Note that, W W. If W = W, P( ) is maximized when context word = input word which is a rare event. If the output (or context) word appears in the window, w output w input increases. Seoul National University Deep Learning March-June, / 56

13 Word2Vec Training the Model (Rong, 2014) Initialize W Read a (context, input) pair Update W Update W Read another (context, input) pair Initialization: W ij = U[ 0.5, 0.5] i, j Suppose v (output) = e o appeared in the context of v (input) = e i. Update W by minimizing log-likelihood: V L log P(e o e i ) = log( exp(u ij )) u io where u ij = w j w i, j = 1,... V j=1 Seoul National University Deep Learning March-June, / 56

14 Word2Vec Training the Model (Rong, 2014) Taking derivatives: L u ik = u ik w k L w k = w i exp(u ik ) V j=1 exp(u ij) δ (k=o) = [P(e k e i ) δ (k=o) ]w i, k = 1,..., V With gradient descent, the updating equation: w k(new) = w k(old) α[p(e k e i ) δ (k=o) ]w i, k = 1,..., V If k = o, [P(e k e i ) δ k=o ] < 0. This indicates underestimating case. Thus the updating equation adds w i -direction on w k In summary, the updating equation increases u io and decreases u ik, k o. Seoul National University Deep Learning March-June, / 56

15 Word2Vec Training the Model (Rong, 2014) Given W, update W. Reminder: v (output) = e o appeared in the context of v (input) = e i. Taking derivatives w.r.t. w i L u ik = u ik w i L w i = exp(u ik ) V j=1 exp(u δ(k = o) ij) = w k V j=1 L u ij u ij w i = V j=1 [P(e j e i ) δ (j=o) ]w j Define EH = V j=1 [P(e j e i ) δ (j=o) ]w j : sum of output vectors, weighted by their prediction error. With gradient descent, the updating equation: w i(new) = w i(old) αeh Seoul National University Deep Learning March-June, / 56

16 Word2Vec CBOW and Skip-gram (Mikolov et al., 2013) Figure: CBOW Model Figure: Skip-gram Model Seoul National University Deep Learning March-June, / 56

17 Word2Vec Training CBOW (Rong, 2014) Input: v (t c),, v (t 1), v (t+1),, v (t+c). Output: v (t) suppose v (t c) = e t(1),..., v (t+c) = e t(2c), and v (t) = e o. Define: h t 1 2c ±c j=1 Suppose Then the model becomes Wv (t+j) = 1 2c P(e o e t(1),..., e t(2c) ) = 2c w t(k) k=1 exp(w o h t ) V j=1 exp(w j h t) The loss is defined by negative log-likelihood: V L log exp(w j h t ) w o h t j=1 where u tj = w j h t, j = 1,... V. Seoul National University Deep Learning March-June, / 56

18 Word2Vec Training CBOW (Rong, 2014) With similar calculation, the updating equation for W becomes: w k(new) = w k(old) α[p(e o v (t c),, v (t+c) ) δ (k=o) ]h t For W, note that u tj = w j h t = 1 2c 2c k=1 w j w t(k) For back propagation: L w t(k) = V L u tj u tj w j=1 t(k) = 1 2c V j=1 Thus the updating equation becomes: [P(e o v (t c),, v (t+c) ) δ (k=o) ]w j = 1 2c EH w t(k)(new) = w t(k)(old) α 1 EH, k = 1,..., 2c 2c Seoul National University Deep Learning March-June, / 56

19 Word2Vec Training Skip-gram (Rong, 2014) Input:v (t). Output: v (t c),, v (t 1), v (t+1),, v (t+c). Suppose, v (t) = e i, and v (t c) = e t(1),..., v (t+c) = e t(2c). Then the model becomes: P(v (t c),, v (t 1), v (t+1),, v (t+c) v (t) ) = The loss becomes: L 2c k=1 L k = 2c k=1 [log V j=1 where u (k) ij = w j w i: score for only k-th loss. = 2c k=1 2c k=1 P(e t(k) e i ) exp(u (k) ij ) u (k) it(k) ] exp(w t(k) w i) V j=1 exp(w j w i) Seoul National University Deep Learning March-June, / 56

20 Word2Vec Training Skip-gram (Rong, 2014) For W, j = 1,..., V L w j = = 2c k=1 2c k=1 L k u (k) ij u (k) ij w j [P(e t(k) e i ) δ (j=t(k)) ]w i Thus, the updating equation for W becomes: 2c w j(new) = w j(old) α [P(e t(k) e i ) δ (j=t(k)) ]w i, j = 1,... V k=1 Seoul National University Deep Learning March-June, / 56

21 Word2Vec Training Skip-gram (Rong, 2014) For W, L w i = = 2c V k=1 j=1 2c V k=1 j=1 L k u (k) ij u (k) ij w i [P(e t(k) e i ) δ (j=t(k)) ]w j Thus, the updating equation for W becomes: w i(new) = w i(old) α 2c EH (k) k=1 2c EH (k) k=1 Seoul National University Deep Learning March-June, / 56

22 Word2Vec Compuational Problem For each input, output pair in corpus C = (v (1), v (2),..., v (N) ), the model must calculate: P(e o e i ) = exp(w o w i ) V j=1 exp(w j w i) For each epoch, almost N V times of inner product of d-dim vectors. (skip-gram: N V 2c) The calculation is proportional to V (Mikolov et al., 2013) suggests 2 alternative formulations: Hierarchical softmax and Negative sampling Seoul National University Deep Learning March-June, / 56

23 Word2Vec Hierarchical softmax (Mikolov et al., 2013) Efficient way of computing softmax Build a Huffman binary tree using word frequency Instead of w j the model uses w n(e j,l) n(e j, l): l-th node to the way from root to the word e j Figure: Binary tree for HS Let h i be the hidden node. Then the probability model becomes: P(e o e i ) = L(e o) 1 l=1 σ([n(e o, l+1) is at left child of n(e o, l)] w n(e o,l) h i) Seoul National University Deep Learning March-June, / 56

24 Word2Vec Training Hierarchical Softmax (Rong, 2014) Let L = log P(e o e i ), and w n(e = w o,l) l. Then: { σ(w L w l = {σ([ ]w l h i ) 1}[ ] = h i σ(w l h i ) is the probability of [w l+1 l h i ) 1 [ ] = 1 [ ] = 1 σ(w l h i ) is left child node of w l ]. Thus, L w l = P[w l+1 is left child node of w l h ] δ [ ] i Thus the updating equation becomes: for l = 1,..., L(e o ) 1 w l(new) = w l(old) α(p[w l+1 is left child node of w l ] δ [ ])h i For skip-gram model, repeat this procedure for 2c outputs. The updating equation for W becomes: w i(new) = w i(old) αeh h L(e o) 1 i where, EH = (P[ ] δ w [ ] )w l i l=1 Seoul National University Deep Learning March-June, / 56

25 Word2Vec Nagative Sampling (Mikolov et al., 2013) Generate e n(1),..., e n(k) from the noise distribution P n The goal is to discriminate (h i, e o ) from (h i, e n(1) ),..., (h i, e n(k) ) For skip gram model, repeat this procedure with each 2c outputs. k = are useful. For large datasets, k can be small as 2 5. The noise distribution: P n (e n ) significantly. [ #(en) N ] 3/4 outperformed Figure: 5-Negative Sampling Seoul National University Deep Learning March-June, / 56

26 Word2Vec Objective in Negative Sampling (Goldberg and Levy, 2014) Suppose (h i, e o ) and (h i, e n(1) ),..., (h i, e n(k) ), (o n(j), j = 1,... k) is given. Let [D = 1 h i, e j ] be the event that the pair (h i, e j ) is came from the original corpus. The model assumes: P(D = 1 h i, e j ) = σ(w j h i). Thus the likelihood becomes: k [ ] σ(w o h i ) 1 σ(w n(j) h i) j=1 Taking log leads to the objective in (Mikolov et al. 2013): k log σ(w o h i ) + log σ( w n(j) h i) e n(j) P n j=1 Note that training h i given w o, w n(1),..., w n(k), is a logistic regression. Seoul National University Deep Learning March-June, / 56

27 Word2Vec Training Negative Sampling (Rong, 2014) Define loss as: L w j h i = L = log σ(w o h i ) k log σ( w n(j) h i) j=1 Let W neg = {w n(1),..., w n(k) }. Then the derivative: { σ(w j h i) 1 w j = w o σ(w j h i) w j W neg = P(D = 1 h i, e j ) δ (j=o) Thus the updating equation for W : for j = o, n(1),..., n(k), w j(new) = w j(old) α[p(d = 1 h i, e j ) δ (j=o) ]h i Let L h i = n(k) j=1,n(1) (P(D = 1 h i, e j ) δ (j=o) )w j EH. Then the updating equation for W : w i(new) = w i(old) αeh h i w i Seoul National University Deep Learning March-June, / 56

28 Word2Vec Two Pre-processing Techniques (Mikolov et al., 2013) Frequent words (such as a, the, in ) provide less information value than rare words. Let V = {v 1,..., v V }, be the vocabulary set. Discard each word v i with probability: P(v i ) = 1 t [#(v i )/N] where t = 10 5 is a proper threshold value. New York Times, Toronto Maple Leafs can be considered as one word. In order to find those phrases, define a score: score(v i, v j ) = #(v iv j ) δ #(v i )#(v j ) Over 2-4 cycles of the training set, calculate the score with decreasing δ. Above some threshold value, set v i v j as a word. Seoul National University Deep Learning March-June, / 56

29 GloVe GloVe Seoul National University Deep Learning March-June, / 56

30 GloVe Motivation (Pennington et al., 2014) Let V = {v 1,..., v V }, be the vocabulary set. Throughout the corpus C, define some statistics: X ij : #(word v j is in the context of word v i ) X i k X ik: #(Any word appear in the context of v i ) P ij = X ij /X i : Probability that v j appear in the context of v i How can we measure similarity between words, sim(v i, v j )? Seoul National University Deep Learning March-June, / 56

31 GloVe Motivation (Pennington et al., 2014) Co-occurence probabilities for ice and steam with selected context words from a corpus (N=6 billion) If v k is related to v i rather than v j, than P ik /P jk will be larger than 1. If v k is related (or not related) to both v i and v j, then P ik /P jk will close to 1. The ratio P ik /P jk is useful to find out whether v k is close to v i (or v j ) Figure: from (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

32 GloVe Model Setup (Pennington et al., 2014) With the motivation, the model becomes: P ik P jk = F (w i, w j, w k ) w i, w j, w k Rd Setting 2 kinds of parameters W, W can help reduce overfitting, noise and generally improve results (Ciresan et al., 2012) In vector space, knowing w 1,..., w V is same as knowing w 1 w i,..., w V w i. Thus the F can be restricted to: P ik P jk = F (w i w j, w k ) In order to match the dimension and preserve the linear structure, use dot products: P ik = F [ (w i w j ) w ] k P jk Seoul National University Deep Learning March-June, / 56

33 GloVe Model Setup (Pennington et al., 2014) For any i, j, k, l = 1,..., V, F [ (w i w j ) w k] [ F (wj w l ) w k ] P ik = = F [ (w i w l ) w ] k P lk It is natural to define F satisfying F (x)f (y) = F (x + y). This implies F = exp( ). Moreover: F [ (w i w j ) w k ] exp(w i w = k ) exp(w j w k ) = P ik P jk Thus,w i w k = log P ik = log X ik log X i Since the role of a word and a context is exchangable, w i w k = w k w i. Seoul National University Deep Learning March-June, / 56

34 GloVe Model Setup (Pennington et al., 2014) Consider log X i as a bias of input representation: b i and add another bias b k. Finally, the model becomes: w i w k + b i + b k = log X ik Now, define a weighted cost function: L = V i,j=1 f (X ij )(w i w j + b i + b j log X ij ) 2 The weight must satisfy: f (0) = 0: In order to avoid the case X ij = 0. f must be non-decreasing: frequent co-occurence must be emphasized f should be relatively small for large values: case of in, the, and Seoul National University Deep Learning March-June, / 56

35 GloVe Training GloVe (Pennington et al., 2014) f is suggested as: f (x) = { (x/x max ) α x < x max 1 x x max x max is reported to have weak impact on performance. (fix x max = 100) α = 3/4 has a modest improvement over α = 1. Training with AdaGrad (Duchi et al., 2011), stocastically sampling non-zero elements of X. The model generates W and W. The model concludes with W + W. Seoul National University Deep Learning March-June, / 56

36 Toy Implementation Toy Implementation Seoul National University Deep Learning March-June, / 56

37 Toy Implementation Data and Model Descriptions Movie review data from NLTK corpus. Consist of plot summary and critique. Corpus size N = 1.5million, Vocabulary size V = Embedding dimension:d = 100, window size:c = 5. Negative sample size: k = 5. GloVe trained with 10 epochs. Time elapsed for training (Intel Core i7 3.60GHz): Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe Time 9.14s 4.53s 12.4s 12.3s 44.2s Seoul National University Deep Learning March-June, / 56

38 Toy Implementation Results Similarity between two vectors Seoul National University Deep Learning March-June, / 56

39 Toy Implementation Results Similarity between two vectors (most frequent words) Seoul National University Deep Learning March-June, / 56

40 Toy Implementation Results Top 5 similar words with villian Seoul National University Deep Learning March-June, / 56

41 Toy Implementation Results Linear relationship: ( actor + she - actress =?) Seoul National University Deep Learning March-June, / 56

42 Toy Implementation Results Linear relationship: ( king + she - he =?) Seoul National University Deep Learning March-June, / 56

43 Performances Performances Seoul National University Deep Learning March-June, / 56

44 Performances Intrinsic Performances (Pennington et al., 2014) Word analogies task: 19,544 questions Symantic: Athens is to Greece as Berlin is to (? ) Syntatic: dance is to dancing as fly is to (? ) Corpus: Gigaword5 + Wikipedia2014 Percentage of correct answers: Model d N Sem. Syn. Tot. CBOW 300 6B SG 300 6B GloVe 300 6B Table: From (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

45 Performances Extrinsic Performances (Pennington et al., 2014) Named entity recognition (NER) with Conditional Random Field (CRF) model Input: Jim bought 300 shares of Acme Corp. in 2006 Output: [Jim](person) bought 300 shares of [Acme Corp.](Organization) in Entities: person, location, organization, miscellaneous. Seoul National University Deep Learning March-June, / 56

46 Performances Extrinsic Performances (Pennington et al., 2014) Trained with CoNLL-03 training set and 50-dimensional word vectors. F1 score on validation set and 3 kinds of test sets: Model Validation CoNLL-Test ACE MUC7 Discrete CBOW SG None None None None GloVe Table: From (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

47 Word Embedding + RNN Word Embedding + RNN Seoul National University Deep Learning March-June, / 56

48 Word Embedding + RNN How to Add Embedded Vectors to RNN Recall RNN model: Input: x t Hidden unit: h t = tanh(b + U h h t 1 + U i x t ) Output unit: o t = c + U o h t Predicted probability: p t = softmax(o t ) Unknown parameters: (U i, U o, U h, b, c) Seoul National University Deep Learning March-June, / 56

49 Word Embedding + RNN How to Add Embedded Vectors to RNN With word embeddings: Input: w i(t) = Wx t Hidden unit: h t = tanh(b + U h h t 1 + U i w i(t) ) Output unit: o t = c + U o h t Predicted probability: p t = softmax(o t ) Unknown parameters: (W, U i, U o, U h, b, c) W is not just input. Instead, it is the initial weight of the word vectors. Fine tuning the word vectors for specific goal. Another derivative is added: for k = 1,..., V L = w k i(t)=k Can be generalized to LSTM and GRU. L o t o t h t h t w k Seoul National University Deep Learning March-June, / 56

50 Word Embedding + RNN Word-rnn (Eidnes, 2015) Goal: Generating clickbait headlines Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel, Huffington Post and Upworthy RNN model using GloVe words vectors (N = 6B, d = 200) as initial weights. 3-layer LSTM model with T = Seoul National University Deep Learning March-June, / 56

51 Word Embedding + RNN Word-rnn (Eidnes, 2015) 8 first completions of Barack Obama Says : Barack Obama Says It s Wrong To Talk About Iraq Barack Obama Says He s Like A Single Mother And Over The Top Barack Obama Says He Did 48 Things Over Barack Obama Says About Ohio Law Barack Obama Says He Is Wrong Barack Obama Says He Will Get The American Idol Barack Obama Says Himself Are Doing Well Around The World Barack Obama Says As He Leaves Politics With His Wife More on the website written in the references Most of the generated sentences are grammatically correct and make sense. Seoul National University Deep Learning March-June, / 56

52 Word Embedding + RNN Word-rnn (Eidnes, 2015) The model seems to understand the gender and political context. Mary J. Williams On Coming Out As A Woman Romney Camp: I Think You Are A Bad President Updating W for only 2-layers works best. Figure: From (Eidnes, 2015) Seoul National University Deep Learning March-June, / 56

53 Conclusion Conclusion Seoul National University Deep Learning March-June, / 56

54 Conclusion Summary Embedding discrete words into R d has interesting results Similar word vectors has high-value of cosine-similiarity. Linear relationships: king + she - he =? Embedded vectors can be used as an input or initial weights of deep neural network. Seoul National University Deep Learning March-June, / 56

55 References References Seoul National University Deep Learning March-June, / 56

56 References Key References Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving mikolov et al. s negative-sampling word-embedding method. arxiv preprint arxiv: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp ). Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp ). Seoul National University Deep Learning March-June, / 56

57 References Key References Rong, X. (2014). word2vec parameter learning explained. arxiv preprint arxiv: Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent Neural Networks. [online] Lars Eidnes blog. Available at: [Accessed 8 May 2018]. Seoul National University Deep Learning March-June, / 56

GloVe: Global Vectors for Word Representation 1

GloVe: Global Vectors for Word Representation 1 GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

arxiv: v3 [cs.cl] 30 Jan 2016

arxiv: v3 [cs.cl] 30 Jan 2016 word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Neural Word Embeddings from Scratch

Neural Word Embeddings from Scratch Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

Lecture 6: Neural Networks for Representing Word Meaning

Lecture 6: Neural Networks for Representing Word Meaning Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018 Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Deep Learning. Ali Ghodsi. University of Waterloo

Deep Learning. Ali Ghodsi. University of Waterloo University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the

More information

Distributional Semantics and Word Embeddings. Chase Geigle

Distributional Semantics and Word Embeddings. Chase Geigle Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 What is a word? dog 2 What is a word? dog canine 2 What is a word? dog canine 3 2 What is a word? dog 3 canine 399,999 2 What is a

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding. c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE 113 0033 2 35 10 4F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding

More information

Bayesian Paragraph Vectors

Bayesian Paragraph Vectors Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University

More information

CS224n: Natural Language Processing with Deep Learning 1

CS224n: Natural Language Processing with Deep Learning 1 CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,

More information

Lecture 7: Word Embeddings

Lecture 7: Word Embeddings Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors

More information

Word Embeddings 2 - Class Discussions

Word Embeddings 2 - Class Discussions Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions

More information

Deep Learning for NLP Part 2

Deep Learning for NLP Part 2 Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute

More information

Deep Learning For Mathematical Functions

Deep Learning For Mathematical Functions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Neural Networks for NLP. COMP-599 Nov 30, 2016

Neural Networks for NLP. COMP-599 Nov 30, 2016 Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Data Mining & Machine Learning

Data Mining & Machine Learning Data Mining & Machine Learning CS57300 Purdue University April 10, 2018 1 Predicting Sequences 2 But first, a detour to Noise Contrastive Estimation 3 } Machine learning methods are much better at classifying

More information

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Simone Teufel (Lead demonstrator Guy Aglionby) sht25@cl.cam.ac.uk; ga384@cl.cam.ac.uk This is the

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb

More information

CS224n: Natural Language Processing with Deep Learning 1

CS224n: Natural Language Processing with Deep Learning 1 CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part I 2 Winter 2017 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data

More information

ATASS: Word Embeddings

ATASS: Word Embeddings ATASS: Word Embeddings Lee Gao April 22th, 2016 Guideline Bag-of-words (bag-of-n-grams) Today High dimensional, sparse representation dimension reductions LSA, LDA, MNIR Neural networks Backpropagation

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015 Spatial Transormer Re: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transormer Networks, NIPS, 2015 Spatial Transormer Layer CNN is not invariant to scaling and rotation

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

A fast and simple algorithm for training neural probabilistic language models

A fast and simple algorithm for training neural probabilistic language models A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Seman&cs with Dense Vectors. Dorota Glowacka

Seman&cs with Dense Vectors. Dorota Glowacka Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values

More information

Towards Universal Sentence Embeddings

Towards Universal Sentence Embeddings Towards Universal Sentence Embeddings Towards Universal Paraphrastic Sentence Embeddings J. Wieting, M. Bansal, K. Gimpel and K. Livescu, ICLR 2016 A Simple But Tough-To-Beat Baseline For Sentence Embeddings

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Natural Language Processing

Natural Language Processing David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Machine Learning for Smart Learners

Machine Learning for Smart Learners Department of Computer Science Instituto de Matemathics and Statistics University of Sao Paulo, Brazil Sao Paulo School of Advanced Science on Smart Cities 2017 What Is All This Fuss About Machine Learning?

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT 2013-12-30 Outline N-gram Introduction data sparsity and smooth NNLM Introduction Multi NNLMs Toolkit Word2vec(Deep learing in NLP)

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

CSE 446 Dimensionality Reduction, Sequences

CSE 446 Dimensionality Reduction, Sequences CSE 446 Dimensionality Reduction, Sequences Administrative Final review this week Practice exam questions will come out Wed Final exam next week Wed 8:30 am Today Dimensionality reduction examples Sequence

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Natural Language Processing

Natural Language Processing David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202

More information

Prepositional Phrase Attachment over Word Embedding Products

Prepositional Phrase Attachment over Word Embedding Products Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe Prepositional Phrases I

More information

Deep Learning. Language Models and Word Embeddings. Christof Monz

Deep Learning. Language Models and Word Embeddings. Christof Monz Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models Incrementally Learning the Hierarchical Softmax Function for Neural Language Models Hao Peng Jianxin Li Yangqiu Song Yaopeng Liu Department of Computer Science & Engineering, Beihang University, Beijing

More information

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular

More information

Bag of Words Meets Bags of Popcorn

Bag of Words Meets Bags of Popcorn Sentiment Analysis via and Natural Language Processing Tarleton State University July 16, 2015 Data Description Sentiment Score tf-idf NDSI AFINN List word score invincible 2 mirthful 3 flops -2 hypocritical

More information

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith A Probabilistic Model for Canonicalizing Named Entity Mentions Dani Yogatama Yanchuan Sim Noah A. Smith Introduction Model Experiments Conclusions Outline Introduction Model Experiments Conclusions Outline

More information

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Generative Models for Sentences

Generative Models for Sentences Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014 Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE)

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. deeplearning.ai. Why sequence models? Recurrent Neural Networks deeplearning.ai Why sequence models? Examples of sequence data The quick brown fox jumped over the lazy dog. Speech recognition Music generation Sentiment classification There

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Combining Static and Dynamic Information for Clinical Event Prediction

Combining Static and Dynamic Information for Clinical Event Prediction Combining Static and Dynamic Information for Clinical Event Prediction Cristóbal Esteban 1, Antonio Artés 2, Yinchong Yang 1, Oliver Staeck 3, Enrique Baca-García 4 and Volker Tresp 1 1 Siemens AG and

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information