Deep Learning: A Statistical Perspective

Size: px

Start display at page:

Download "Deep Learning: A Statistical Perspective"

Edwina Gallagher
5 years ago
Views:

1 Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June, 2018 Seoul National University Deep Learning March-June, / 56

2 Introduction Introduction Seoul National University Deep Learning March-June, / 56

3 Introduction Natural Language Processing Natural Language Processing (NLP) includes: Sentiment analysis Machine translation Text generation... How to train language? How can we convert language into numbers? Seoul National University Deep Learning March-June, / 56

4 Introduction Word Embedding How to map words in to R d? One-hot encoding Each vectors has nothing to do with other vectors u v, u v = 1, u T v = 0 However... Each word is related with its companies. Ice is closer to Solid than Gas Seoul National University Deep Learning March-June, / 56

5 Introduction Main Questions in Word Embedding Vocabulary set: V = {a, the,deep,statistics,...,} Size N corpus: C = (v (1), v (2),..., v (N) ), v (1), v (2),..., v (N) V Given the corpus data, how can we measure similarity between words, sim(deep, statistics)? How can we define f and learn w deep, w statistics such that sim(deep, statistics) = f (w deep, w statistics )? Seoul National University Deep Learning March-June, / 56

6 Introduction Some Famous Word Embedding Techniques Latent Semantic Analysis (LSA) (Deerwester et al. 1990) Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996) Word2Vec (Mikolov et al. 2013a) GloVe (Pennington et al. 2014) Seoul National University Deep Learning March-June, / 56

7 Introduction LSA (Deerwester et al. 1990) Term-document matrix: X t d sim(a, b) co-occurence in each documents. Sigular Value Decomposition: X t d = TSD T With k largest singular values: X t d = T t k S k k (D d k ) T T t k : k-dim term vectors, D d k : k-dim document vectors Figure: from (Deerwester et al. 1990) Seoul National University Deep Learning March-June, / 56

8 Introduction HAL (Lund and Burgess, 1996) Term-context term matrix: X V V How many times the column word appear in front of the row term? sim(a, b) co-occurence in nearby context. Concatenate row/column to make 2V -dim vector Dimension reduction with k principal components. Train 160M terms, with V = 70, 000 Figure: from (Lund and Burgess, 1996) Seoul National University Deep Learning March-June, / 56

9 Introduction Sim(, ) Cooccurence? Cooccurence with and or the does not mean semantic similarity. Appears just frequently? or has significant similarity? Transformation or defining new measure of similarity. Entropy/Correlation based normalization (Rohde et al., 2006) Positive pointwise mutual information(ppmi) max{0, log p(context term) p(context) } (Bullinaria and Levy, 2007) Square root type transformation (Lebret and Collobert, 2014) Train p(context term) within every local window. (Word2Vec) Seoul National University Deep Learning March-June, / 56

10 Word2Vec Word2Vec Seoul National University Deep Learning March-June, / 56

11 Word2Vec Model Setup (Mikolov et al., 2013) Vocabulary set: V = {e 1, e 2,..., e V } {0, 1} V Size N corpus: C = (v (1), v (2),..., v (N) ), v (1), v (2),..., v (N) V Embedded word vectors: W d V = w 1 w 2 w V, W d V = w 1 w 2 w V Seoul National University Deep Learning March-June, / 56

12 Word2Vec Model Setup (Mikolov et al., 2013) Thus, the model becomes: P(v (output) v (input) ) = exp(w output w input ) V j=1 exp(w j w input) W /W is called input/output representation Note that, W W. If W = W, P( ) is maximized when context word = input word which is a rare event. If the output (or context) word appears in the window, w output w input increases. Seoul National University Deep Learning March-June, / 56

13 Word2Vec Training the Model (Rong, 2014) Initialize W Read a (context, input) pair Update W Update W Read another (context, input) pair Initialization: W ij = U[ 0.5, 0.5] i, j Suppose v (output) = e o appeared in the context of v (input) = e i. Update W by minimizing log-likelihood: V L log P(e o e i ) = log( exp(u ij )) u io where u ij = w j w i, j = 1,... V j=1 Seoul National University Deep Learning March-June, / 56

14 Word2Vec Training the Model (Rong, 2014) Taking derivatives: L u ik = u ik w k L w k = w i exp(u ik ) V j=1 exp(u ij) δ (k=o) = [P(e k e i ) δ (k=o) ]w i, k = 1,..., V With gradient descent, the updating equation: w k(new) = w k(old) α[p(e k e i ) δ (k=o) ]w i, k = 1,..., V If k = o, [P(e k e i ) δ k=o ] < 0. This indicates underestimating case. Thus the updating equation adds w i -direction on w k In summary, the updating equation increases u io and decreases u ik, k o. Seoul National University Deep Learning March-June, / 56

15 Word2Vec Training the Model (Rong, 2014) Given W, update W. Reminder: v (output) = e o appeared in the context of v (input) = e i. Taking derivatives w.r.t. w i L u ik = u ik w i L w i = exp(u ik ) V j=1 exp(u δ(k = o) ij) = w k V j=1 L u ij u ij w i = V j=1 [P(e j e i ) δ (j=o) ]w j Define EH = V j=1 [P(e j e i ) δ (j=o) ]w j : sum of output vectors, weighted by their prediction error. With gradient descent, the updating equation: w i(new) = w i(old) αeh Seoul National University Deep Learning March-June, / 56

16 Word2Vec CBOW and Skip-gram (Mikolov et al., 2013) Figure: CBOW Model Figure: Skip-gram Model Seoul National University Deep Learning March-June, / 56

17 Word2Vec Training CBOW (Rong, 2014) Input: v (t c),, v (t 1), v (t+1),, v (t+c). Output: v (t) suppose v (t c) = e t(1),..., v (t+c) = e t(2c), and v (t) = e o. Define: h t 1 2c ±c j=1 Suppose Then the model becomes Wv (t+j) = 1 2c P(e o e t(1),..., e t(2c) ) = 2c w t(k) k=1 exp(w o h t ) V j=1 exp(w j h t) The loss is defined by negative log-likelihood: V L log exp(w j h t ) w o h t j=1 where u tj = w j h t, j = 1,... V. Seoul National University Deep Learning March-June, / 56

18 Word2Vec Training CBOW (Rong, 2014) With similar calculation, the updating equation for W becomes: w k(new) = w k(old) α[p(e o v (t c),, v (t+c) ) δ (k=o) ]h t For W, note that u tj = w j h t = 1 2c 2c k=1 w j w t(k) For back propagation: L w t(k) = V L u tj u tj w j=1 t(k) = 1 2c V j=1 Thus the updating equation becomes: [P(e o v (t c),, v (t+c) ) δ (k=o) ]w j = 1 2c EH w t(k)(new) = w t(k)(old) α 1 EH, k = 1,..., 2c 2c Seoul National University Deep Learning March-June, / 56

19 Word2Vec Training Skip-gram (Rong, 2014) Input:v (t). Output: v (t c),, v (t 1), v (t+1),, v (t+c). Suppose, v (t) = e i, and v (t c) = e t(1),..., v (t+c) = e t(2c). Then the model becomes: P(v (t c),, v (t 1), v (t+1),, v (t+c) v (t) ) = The loss becomes: L 2c k=1 L k = 2c k=1 [log V j=1 where u (k) ij = w j w i: score for only k-th loss. = 2c k=1 2c k=1 P(e t(k) e i ) exp(u (k) ij ) u (k) it(k) ] exp(w t(k) w i) V j=1 exp(w j w i) Seoul National University Deep Learning March-June, / 56

20 Word2Vec Training Skip-gram (Rong, 2014) For W, j = 1,..., V L w j = = 2c k=1 2c k=1 L k u (k) ij u (k) ij w j [P(e t(k) e i ) δ (j=t(k)) ]w i Thus, the updating equation for W becomes: 2c w j(new) = w j(old) α [P(e t(k) e i ) δ (j=t(k)) ]w i, j = 1,... V k=1 Seoul National University Deep Learning March-June, / 56

21 Word2Vec Training Skip-gram (Rong, 2014) For W, L w i = = 2c V k=1 j=1 2c V k=1 j=1 L k u (k) ij u (k) ij w i [P(e t(k) e i ) δ (j=t(k)) ]w j Thus, the updating equation for W becomes: w i(new) = w i(old) α 2c EH (k) k=1 2c EH (k) k=1 Seoul National University Deep Learning March-June, / 56

22 Word2Vec Compuational Problem For each input, output pair in corpus C = (v (1), v (2),..., v (N) ), the model must calculate: P(e o e i ) = exp(w o w i ) V j=1 exp(w j w i) For each epoch, almost N V times of inner product of d-dim vectors. (skip-gram: N V 2c) The calculation is proportional to V (Mikolov et al., 2013) suggests 2 alternative formulations: Hierarchical softmax and Negative sampling Seoul National University Deep Learning March-June, / 56

23 Word2Vec Hierarchical softmax (Mikolov et al., 2013) Efficient way of computing softmax Build a Huffman binary tree using word frequency Instead of w j the model uses w n(e j,l) n(e j, l): l-th node to the way from root to the word e j Figure: Binary tree for HS Let h i be the hidden node. Then the probability model becomes: P(e o e i ) = L(e o) 1 l=1 σ([n(e o, l+1) is at left child of n(e o, l)] w n(e o,l) h i) Seoul National University Deep Learning March-June, / 56

24 Word2Vec Training Hierarchical Softmax (Rong, 2014) Let L = log P(e o e i ), and w n(e = w o,l) l. Then: { σ(w L w l = {σ([ ]w l h i ) 1}[ ] = h i σ(w l h i ) is the probability of [w l+1 l h i ) 1 [ ] = 1 [ ] = 1 σ(w l h i ) is left child node of w l ]. Thus, L w l = P[w l+1 is left child node of w l h ] δ [ ] i Thus the updating equation becomes: for l = 1,..., L(e o ) 1 w l(new) = w l(old) α(p[w l+1 is left child node of w l ] δ [ ])h i For skip-gram model, repeat this procedure for 2c outputs. The updating equation for W becomes: w i(new) = w i(old) αeh h L(e o) 1 i where, EH = (P[ ] δ w [ ] )w l i l=1 Seoul National University Deep Learning March-June, / 56

25 Word2Vec Nagative Sampling (Mikolov et al., 2013) Generate e n(1),..., e n(k) from the noise distribution P n The goal is to discriminate (h i, e o ) from (h i, e n(1) ),..., (h i, e n(k) ) For skip gram model, repeat this procedure with each 2c outputs. k = are useful. For large datasets, k can be small as 2 5. The noise distribution: P n (e n ) significantly. [ #(en) N ] 3/4 outperformed Figure: 5-Negative Sampling Seoul National University Deep Learning March-June, / 56

26 Word2Vec Objective in Negative Sampling (Goldberg and Levy, 2014) Suppose (h i, e o ) and (h i, e n(1) ),..., (h i, e n(k) ), (o n(j), j = 1,... k) is given. Let [D = 1 h i, e j ] be the event that the pair (h i, e j ) is came from the original corpus. The model assumes: P(D = 1 h i, e j ) = σ(w j h i). Thus the likelihood becomes: k [ ] σ(w o h i ) 1 σ(w n(j) h i) j=1 Taking log leads to the objective in (Mikolov et al. 2013): k log σ(w o h i ) + log σ( w n(j) h i) e n(j) P n j=1 Note that training h i given w o, w n(1),..., w n(k), is a logistic regression. Seoul National University Deep Learning March-June, / 56

27 Word2Vec Training Negative Sampling (Rong, 2014) Define loss as: L w j h i = L = log σ(w o h i ) k log σ( w n(j) h i) j=1 Let W neg = {w n(1),..., w n(k) }. Then the derivative: { σ(w j h i) 1 w j = w o σ(w j h i) w j W neg = P(D = 1 h i, e j ) δ (j=o) Thus the updating equation for W : for j = o, n(1),..., n(k), w j(new) = w j(old) α[p(d = 1 h i, e j ) δ (j=o) ]h i Let L h i = n(k) j=1,n(1) (P(D = 1 h i, e j ) δ (j=o) )w j EH. Then the updating equation for W : w i(new) = w i(old) αeh h i w i Seoul National University Deep Learning March-June, / 56

28 Word2Vec Two Pre-processing Techniques (Mikolov et al., 2013) Frequent words (such as a, the, in ) provide less information value than rare words. Let V = {v 1,..., v V }, be the vocabulary set. Discard each word v i with probability: P(v i ) = 1 t [#(v i )/N] where t = 10 5 is a proper threshold value. New York Times, Toronto Maple Leafs can be considered as one word. In order to find those phrases, define a score: score(v i, v j ) = #(v iv j ) δ #(v i )#(v j ) Over 2-4 cycles of the training set, calculate the score with decreasing δ. Above some threshold value, set v i v j as a word. Seoul National University Deep Learning March-June, / 56

29 GloVe GloVe Seoul National University Deep Learning March-June, / 56

30 GloVe Motivation (Pennington et al., 2014) Let V = {v 1,..., v V }, be the vocabulary set. Throughout the corpus C, define some statistics: X ij : #(word v j is in the context of word v i ) X i k X ik: #(Any word appear in the context of v i ) P ij = X ij /X i : Probability that v j appear in the context of v i How can we measure similarity between words, sim(v i, v j )? Seoul National University Deep Learning March-June, / 56

31 GloVe Motivation (Pennington et al., 2014) Co-occurence probabilities for ice and steam with selected context words from a corpus (N=6 billion) If v k is related to v i rather than v j, than P ik /P jk will be larger than 1. If v k is related (or not related) to both v i and v j, then P ik /P jk will close to 1. The ratio P ik /P jk is useful to find out whether v k is close to v i (or v j ) Figure: from (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

32 GloVe Model Setup (Pennington et al., 2014) With the motivation, the model becomes: P ik P jk = F (w i, w j, w k ) w i, w j, w k Rd Setting 2 kinds of parameters W, W can help reduce overfitting, noise and generally improve results (Ciresan et al., 2012) In vector space, knowing w 1,..., w V is same as knowing w 1 w i,..., w V w i. Thus the F can be restricted to: P ik P jk = F (w i w j, w k ) In order to match the dimension and preserve the linear structure, use dot products: P ik = F [ (w i w j ) w ] k P jk Seoul National University Deep Learning March-June, / 56

33 GloVe Model Setup (Pennington et al., 2014) For any i, j, k, l = 1,..., V, F [ (w i w j ) w k] [ F (wj w l ) w k ] P ik = = F [ (w i w l ) w ] k P lk It is natural to define F satisfying F (x)f (y) = F (x + y). This implies F = exp( ). Moreover: F [ (w i w j ) w k ] exp(w i w = k ) exp(w j w k ) = P ik P jk Thus,w i w k = log P ik = log X ik log X i Since the role of a word and a context is exchangable, w i w k = w k w i. Seoul National University Deep Learning March-June, / 56

34 GloVe Model Setup (Pennington et al., 2014) Consider log X i as a bias of input representation: b i and add another bias b k. Finally, the model becomes: w i w k + b i + b k = log X ik Now, define a weighted cost function: L = V i,j=1 f (X ij )(w i w j + b i + b j log X ij ) 2 The weight must satisfy: f (0) = 0: In order to avoid the case X ij = 0. f must be non-decreasing: frequent co-occurence must be emphasized f should be relatively small for large values: case of in, the, and Seoul National University Deep Learning March-June, / 56

35 GloVe Training GloVe (Pennington et al., 2014) f is suggested as: f (x) = { (x/x max ) α x < x max 1 x x max x max is reported to have weak impact on performance. (fix x max = 100) α = 3/4 has a modest improvement over α = 1. Training with AdaGrad (Duchi et al., 2011), stocastically sampling non-zero elements of X. The model generates W and W. The model concludes with W + W. Seoul National University Deep Learning March-June, / 56

36 Toy Implementation Toy Implementation Seoul National University Deep Learning March-June, / 56

37 Toy Implementation Data and Model Descriptions Movie review data from NLTK corpus. Consist of plot summary and critique. Corpus size N = 1.5million, Vocabulary size V = Embedding dimension:d = 100, window size:c = 5. Negative sample size: k = 5. GloVe trained with 10 epochs. Time elapsed for training (Intel Core i7 3.60GHz): Model CBOW+HS CBOW+NEG SG+HS SG+NEG GloVe Time 9.14s 4.53s 12.4s 12.3s 44.2s Seoul National University Deep Learning March-June, / 56

38 Toy Implementation Results Similarity between two vectors Seoul National University Deep Learning March-June, / 56

39 Toy Implementation Results Similarity between two vectors (most frequent words) Seoul National University Deep Learning March-June, / 56

40 Toy Implementation Results Top 5 similar words with villian Seoul National University Deep Learning March-June, / 56

41 Toy Implementation Results Linear relationship: ( actor + she - actress =?) Seoul National University Deep Learning March-June, / 56

42 Toy Implementation Results Linear relationship: ( king + she - he =?) Seoul National University Deep Learning March-June, / 56

43 Performances Performances Seoul National University Deep Learning March-June, / 56

44 Performances Intrinsic Performances (Pennington et al., 2014) Word analogies task: 19,544 questions Symantic: Athens is to Greece as Berlin is to (? ) Syntatic: dance is to dancing as fly is to (? ) Corpus: Gigaword5 + Wikipedia2014 Percentage of correct answers: Model d N Sem. Syn. Tot. CBOW 300 6B SG 300 6B GloVe 300 6B Table: From (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

45 Performances Extrinsic Performances (Pennington et al., 2014) Named entity recognition (NER) with Conditional Random Field (CRF) model Input: Jim bought 300 shares of Acme Corp. in 2006 Output: [Jim](person) bought 300 shares of [Acme Corp.](Organization) in Entities: person, location, organization, miscellaneous. Seoul National University Deep Learning March-June, / 56

46 Performances Extrinsic Performances (Pennington et al., 2014) Trained with CoNLL-03 training set and 50-dimensional word vectors. F1 score on validation set and 3 kinds of test sets: Model Validation CoNLL-Test ACE MUC7 Discrete CBOW SG None None None None GloVe Table: From (Pennington et al., 2014) Seoul National University Deep Learning March-June, / 56

47 Word Embedding + RNN Word Embedding + RNN Seoul National University Deep Learning March-June, / 56

48 Word Embedding + RNN How to Add Embedded Vectors to RNN Recall RNN model: Input: x t Hidden unit: h t = tanh(b + U h h t 1 + U i x t ) Output unit: o t = c + U o h t Predicted probability: p t = softmax(o t ) Unknown parameters: (U i, U o, U h, b, c) Seoul National University Deep Learning March-June, / 56

49 Word Embedding + RNN How to Add Embedded Vectors to RNN With word embeddings: Input: w i(t) = Wx t Hidden unit: h t = tanh(b + U h h t 1 + U i w i(t) ) Output unit: o t = c + U o h t Predicted probability: p t = softmax(o t ) Unknown parameters: (W, U i, U o, U h, b, c) W is not just input. Instead, it is the initial weight of the word vectors. Fine tuning the word vectors for specific goal. Another derivative is added: for k = 1,..., V L = w k i(t)=k Can be generalized to LSTM and GRU. L o t o t h t h t w k Seoul National University Deep Learning March-June, / 56

50 Word Embedding + RNN Word-rnn (Eidnes, 2015) Goal: Generating clickbait headlines Train 2M clickbait headlines scraped from Buzzfedd, Gawker, Jezebel, Huffington Post and Upworthy RNN model using GloVe words vectors (N = 6B, d = 200) as initial weights. 3-layer LSTM model with T = Seoul National University Deep Learning March-June, / 56

51 Word Embedding + RNN Word-rnn (Eidnes, 2015) 8 first completions of Barack Obama Says : Barack Obama Says It s Wrong To Talk About Iraq Barack Obama Says He s Like A Single Mother And Over The Top Barack Obama Says He Did 48 Things Over Barack Obama Says About Ohio Law Barack Obama Says He Is Wrong Barack Obama Says He Will Get The American Idol Barack Obama Says Himself Are Doing Well Around The World Barack Obama Says As He Leaves Politics With His Wife More on the website written in the references Most of the generated sentences are grammatically correct and make sense. Seoul National University Deep Learning March-June, / 56

52 Word Embedding + RNN Word-rnn (Eidnes, 2015) The model seems to understand the gender and political context. Mary J. Williams On Coming Out As A Woman Romney Camp: I Think You Are A Bad President Updating W for only 2-layers works best. Figure: From (Eidnes, 2015) Seoul National University Deep Learning March-June, / 56

53 Conclusion Conclusion Seoul National University Deep Learning March-June, / 56

54 Conclusion Summary Embedding discrete words into R d has interesting results Similar word vectors has high-value of cosine-similiarity. Linear relationships: king + she - he =? Embedded vectors can be used as an input or initial weights of deep neural network. Seoul National University Deep Learning March-June, / 56

55 References References Seoul National University Deep Learning March-June, / 56

56 References Key References Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving mikolov et al. s negative-sampling word-embedding method. arxiv preprint arxiv: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp ). Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp ). Seoul National University Deep Learning March-June, / 56

57 References Key References Rong, X. (2014). word2vec parameter learning explained. arxiv preprint arxiv: Eidnes, L. (2015). Auto-Generating Clickbait With Recurrent Neural Networks. [online] Lars Eidnes blog. Available at: [Accessed 8 May 2018]. Seoul National University Deep Learning March-June, / 56

GloVe: Global Vectors for Word Representation 1

GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background