Sparse Word Embeddings Using l 1 Regularized Online Learning
|
|
- Lydia Dickerson
- 5 years ago
- Views:
Transcription
1 Sparse Word Embeddings Using l 1 Regularized Online Learning Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, {guojiafeng, lanyanyan,junxu, cxq}@ict.ac.cn CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences
2 Distributed Word Representation Machine Translation [Kalchbrenner and Blunsom, 2013] Sentiment Analysis [Maas et al., 2011] Language Modeling [Bengio et al., 2003] Distributed Word Representation POS Taging [Collobert et al., 2011] Parsing [Socher et al., 2011] Word-Sense Disambiguation [Collobert et al., 2011] Distributed word representation is so hot in NLP community. 1
3 w(t-2) w(t-1) w(t+1) w(t+2) INPUT PROJECTION OUTPUT SUM w(t) w(t) INPUT PROJECTION OUTPUT w(t-2) w(t-1) w(t+1) w(t+2) n 2 hu = #tags word of interest d Models i-th output = P(wt = i context) softmax most computation here Input Window Text cat sat o the mat Feature 1 w1 1 w w N Feature K w1 K w2 K... wn K Lookup Table tanh LT W 1. Linear concat C(wt n+1) C(wt 2) C(wt 1) Table Matrix C look up shared parameters in C across words index for wt n+1 index for wt 2 index for wt 1 NPLM LBL score M 1 n 1 HardTanh Linear M 2 C&W CBOW Skip-gram Word2Vec scorel sum... he walks to the bank... Huang scoreg Document river play shore weighted average global semantic vector water State-Of-The-Art: CBOW and SG. 2
4 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] 1 Vectors from GoogleNews-vectors-negative300.bin. 3
5 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? 1 Vectors from GoogleNews-vectors-negative300.bin. 3
6 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? 1 Vectors from GoogleNews-vectors-negative300.bin. 3
7 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? Gender dimension(s) would be active in all the word vectors including irrelevant words like computer. 1 Vectors from GoogleNews-vectors-negative300.bin. 3
8 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? Gender dimension(s) would be active in all the word vectors including irrelevant words like computer. Difficult in interpretation and uneconomic in storage. 1 Vectors from GoogleNews-vectors-negative300.bin. 3
9 Sparse Word Representation Non-Negative Sparse Embedding (NNSE) [Murphy et al., 2012] Sparse Coding (Word2Vec) [Faruqui et al., 2015] arg min A R m k D R k n m ) ( X i A i D 2 +λ A i 1 i=1 where, A i,j 0, 1 i m, 1 j k D i D i 1, 1 i k V L X Sparse coding Initial dense vectors K D K D V x A K Sparse overcomplete vectors V x Projection V B K Non-negative sparse coding Sparse, binary overcomplete vectors Introduce sparse and non-negative constraints into MF. Convert dense vector using sparse coding in a post-processing way. 4
10 Sparse Word Representation Non-Negative Sparse Embedding (NNSE) [Murphy et al., 2012] Sparse Coding (Word2Vec) [Faruqui et al., 2015] arg min A R m k D R k n m ) ( X i A i D 2 +λ A i 1 i=1 where, A i,j 0, 1 i m, 1 j k D i D i 1, 1 i k V L X Sparse coding Initial dense vectors K D K D V x A K Sparse overcomplete vectors V x Projection V B K Non-negative sparse coding Sparse, binary overcomplete vectors Introduce sparse and non-negative constraints into MF. Convert dense vector using sparse coding in a post-processing way. Thay are difficult to train on large-scale data for heavy memory usage! 4
11 Our Motivation Good Performance Fast Large Scale Interpretable 5
12 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec 5
13 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec Sparse 5
14 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec Sparse Sparse CBOW Directly apply the sparse constraint to CBOW 5
15 CBOW c i 2 c i 1 c i+1 c i+2. w i L cbow = N log p(w i h i ) i=1 exp( w # i # h i ) p(w i h i ) = w W exp(# w # h i ) # h i = 1 i+l # c j 2l. CBOW j=i l j i L ns cbow = N ( log σ( w # i # h i ) + k E w P log σ( # w W # ) h i ) i=1 6
16 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W 7
17 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD 7
18 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD failed 7
19 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD failed Regularized Dual Averaging (RAD) [Xiao, 2009] Truncating using online average subgradients 7
20 RDA algorithm for Sparse CBOW 1: procedure SparseCBOW(C) 2: Initialize: w #, w W, # c, c C, ḡ 0 # w = # 0, w W 3: for i = 1, 2, 3,... do 4: t update time of word w i # i+l 5: h i = 1 2l # c j j=i l j i [ 6: g t w # i = 1 h (w i ) σ( w # t i # # h h i )] i 7: ḡ t w # i = t 1 ḡ t 1 t # w i + 1 g t # t w i Keeping track of the online average subgradients 8: Update w # { i element-wise according to w # 0 if ḡ t # w t+1 ij = λ, ij #(w i ) 9: ηt ( ḡ t w # ij λ sgn(ḡ t # #(w i ) w ij ) ) otherwise, Truncating where, j = 1, 2,..., d 10: for k = l,..., 1, 1,..., l do 11: update # c i+k according to 12: # c i+k := # [ c i+k + α 1 2l hi (w i ) σ( w # t i # h i )] # w t i 13: end for 14: end for 15: end procedure 8
21 Evaluation
22 Baselines & Tasks Baseline Dense representation models GloVe [Pennington et al., 2014] CBOW and SG [Mikolov et al., 2013] Sparse representation models Sparse Coding (SC) [Faruqui et al., 2015] Positive Pointwise Mutual Information (PPMI) [Bullinaria and Levy, 2007] NNSE [Murphy et al., 2012] Tasks Interpretability Word Intrusion Expressive Power Word Analogy Word Similarity 9
23 Experimental Settings Corpus: Wikipedia 2010 (1B words) Parameters Setting: window negative iteration λ learning rate noise distribution grid search 0.05 #(w) 0.75 Baseline: Model Setting GloVe, CBOW, SG same setting with released tools SC Embeddings of CBOW as input NNSE PPMI matrix, 4,0000 words, SPAMS
24 Interpretability
25 Word Intrusion [Chang et al., 2009] Sort in descending dim vocab markov i 0.27 Top 5 poisson parametric markov Pick out bayesian 0.23 jodel poisson Sorted List bayesian stochastic jodel poisson parametric jodel bayesian stochastic markov 1. Sort dimension i in descending order. 2. A set: {top 5 words, 1 word (bottom 50% in i & top 10% in j, j i)} 3. Pick out the intruder word More interpretable, more easy to pick out. 11
26 Evaluation Metric Definition traditional metric: human assignment, subjective, costly. IntraDist i = InterDist i = DistRatio = 1 d w j top k (i) w j top k (i) d i=1 w k top k (i) w k w j dist(w j, w bi ) k InterDist i IntraDist i dist(w j, w k ) k(k 1) IntraDist i : Average distance between top k words InterDist i : Average distance between top k words and bottom word Intuition: The intruder word should be dissimilar to the top words while those top words should be similar to each other. In this work, we use euclidean distance as dist(, ) 12
27 Word Intrusion Results Table 1: 300 dimension, running 10 times. Model Sparsity DistRatio GloVe 0% 1.07 CBOW 0% 1.09 SG 0% 1.12 NNSE (PPMI) 89.15% 1.55 SC (CBOW) 88.34% 1.24 Sparse CBOW 90.06% 1.39 Sparse is better than dense. Sparse CBOW VS. SC (CBOW): information loss in a separate sparse coding step. Non-negative constraint might also be a good choice. 13
28 Case Study Model CBOW Sparse CBOW Top 5 Words beat, finish, wedding, prize, read rainfall, footballer, breakfast, weekdays, angeles landfall, interview, asked, apology, dinner becomes, died, feels, resigned, strained best, safest, iucn, capita, tallest poisson, parametric, markov, bayesian, stochastic statistical learning ntfs, gzip, myfile, filenames, subdirectories file system hugely, enormously, immensely, wildly, tremendously adverb for degree earthquake, quake, uprooted, levees, spectacularly disasters bosons, accretion, higgs, neutrinos, quarks particles The dimensions of Sparse CBOW reveal some clear and consistent semantic meanings. 14
29 Expressive Power
30 Expressive Power Task Word Analogy Word Similarity Rare Word (RW) [Luong et al., 2013] Testset Google [Mikolov et al., 2013] WordSim-353 (WS-353) [Finkelstein et al., 2002] SimLex-999 (SL-999) [Hill et al., 2015] Example Beijing: China Paris:? big: bigger deep:? (tiger cat 7.35) Solution 3CosMul [Levy and Goldberg, 2014] cos Metric Percentage Spearman Rank Correlation 15
31 Results Table 2: Precision (%) for analogy and spearman correlation for similarity. Model Dim Sparsity Sem Syn Total WS-353 SL-999 RW GloVe 300 0% CBOW 300 0% SG 300 0% PPMI(W-C) % PPMI(W-C) % NNSE (PPMI) % SC (CBOW) % SC (CBOW) % Sparse CBOW % Sparse CBOW is a competitive model using much less memory. Similar performance on analogy if we reduce its sparsity level < 85%. 2 The input matirx of NNSE is the dimensional representations of PPMI in fourth row. 16
32 Summary A sparse word representation model. A new evaluation metric for word intrusion task. Improvement in word vector interpretability. Similar performance with less memory usage. 17
33 Thanks! Q & A 17
34 References I Bullinaria, J. A. and Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3): Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In NIPS, pages Curran Associates, Inc. Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C., and Smith, N. A. (2015). Sparse overcomplete word vector representations. In Proceedings of ACL, pages Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., andgadi Wolfman, Z. S., and Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):
35 References II Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, pages Levy, O. and Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In Proceedings of CoNLL, pages Luong, M.-T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of CoNLL, pages Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of ICLR. Murphy, B., Talukdar, P., and Mitchell, T. (2012). Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING, pages
36 References III Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of EMNLP, pages Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In Proceedings of NIPS, pages
Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations
Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute
More informationNeural Word Embeddings from Scratch
Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationDeep Learning. Ali Ghodsi. University of Waterloo
University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationGloVe: Global Vectors for Word Representation 1
GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationMatrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations Alexandre Salle 1 Marco Idiart 2 Aline Villavicencio 1 1 Institute of Informatics 2 Physics Department
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationDistributional Semantics and Word Embeddings. Chase Geigle
Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 What is a word? dog 2 What is a word? dog canine 2 What is a word? dog canine 3 2 What is a word? dog 3 canine 399,999 2 What is a
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationWord Embeddings 2 - Class Discussions
Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationa) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM
c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of
More informationNeural Networks for NLP. COMP-599 Nov 30, 2016
Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks
More informationThe representation of word and sentence
2vec Jul 4, 2017 Presentation Outline 2vec 1 2 2vec 3 4 5 6 discrete representation taxonomy:wordnet Example:good 2vec Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate
More informationarxiv: v3 [cs.cl] 30 Jan 2016
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention
More informationWord2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.
c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE 113 0033 2 35 10 4F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding
More informationRaRE: Social Rank Regulated Large-scale Network Embedding
RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Learning: A Statistical Perspective
Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June,
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationEmbeddings Learned By Matrix Factorization
Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationIncrementally Learning the Hierarchical Softmax Function for Neural Language Models
Incrementally Learning the Hierarchical Softmax Function for Neural Language Models Hao Peng Jianxin Li Yangqiu Song Yaopeng Liu Department of Computer Science & Engineering, Beihang University, Beijing
More informationSequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018
Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,
More informationCS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention
CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationDo Neural Network Cross-Modal Mappings Really Bridge Modalities?
Do Neural Network Cross-Modal Mappings Really Bridge Modalities? Language Intelligence and Information Retrieval group (LIIR) Department of Computer Science Story Collell, G., Zhang, T., Moens, M.F. (2017)
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,
More informationSemantics with Dense Vectors
Speech and Language Processing Daniel Jurafsky & James H Martin Copyright c 2016 All rights reserved Draft of August 7, 2017 CHAPTER 16 Semantics with Dense Vectors In the previous chapter we saw how to
More informationModeling Topics and Knowledge Bases with Embeddings
Modeling Topics and Knowledge Bases with Embeddings Dat Quoc Nguyen and Mark Johnson Department of Computing Macquarie University Sydney, Australia December 2016 1 / 15 Vector representations/embeddings
More information(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationarxiv: v1 [cs.cl] 19 Nov 2015
Gaussian Mixture Embeddings for Multiple Word Prototypes Xinchi Chen, Xipeng Qiu, Jingxiang Jiang, Xuaning Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of
More informationOverview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop
Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand
More informationLecture 7: Word Embeddings
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors
More informationContinuous Space Language Model(NNLM) Liu Rong Intern students of CSLT
Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT 2013-12-30 Outline N-gram Introduction data sparsity and smooth NNLM Introduction Multi NNLMs Toolkit Word2vec(Deep learing in NLP)
More informationarxiv: v3 [cs.cl] 23 Feb 2016
WordRank: Learning Word Embeddings via Robust Ranking arxiv:1506.02761v3 [cs.cl] 23 Feb 2016 Shihao Ji Parallel Computing Lab, Intel shihao.ji@intel.com Shin Matsushima University of Tokyo shin_matsushima@mist.
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationDeep Learning For Mathematical Functions
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationWord Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst
Word Representations via Gaussian Embedding Luke Vilnis Andrew McCallum University of Massachusetts Amherst Vector word embeddings teacher chef astronaut composer person Low-Level NLP [Turian et al. 2010,
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationFactorization of Latent Variables in Distributional Semantic Models
Factorization of Latent Variables in Distributional Semantic Models Arvid Österlund and David Ödling KTH Royal Institute of Technology, Sweden arvidos dodling@kth.se Magnus Sahlgren Gavagai, Sweden mange@gavagai.se
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationData Mining & Machine Learning
Data Mining & Machine Learning CS57300 Purdue University April 10, 2018 1 Predicting Sequences 2 But first, a detour to Noise Contrastive Estimation 3 } Machine learning methods are much better at classifying
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationDeep Learning. Language Models and Word Embeddings. Christof Monz
Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationSPECTRALWORDS: SPECTRAL EMBEDDINGS AP-
SPECTRALWORDS: SPECTRAL EMBEDDINGS AP- PROACH TO WORD SIMILARITY TASK FOR LARGE VO- CABULARIES Ivan Lobov Criteo Research Paris, France i.lobov@criteo.com ABSTRACT In this paper we show how recent advances
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationInstructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)
Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Simone Teufel (Lead demonstrator Guy Aglionby) sht25@cl.cam.ac.uk; ga384@cl.cam.ac.uk This is the
More informationATASS: Word Embeddings
ATASS: Word Embeddings Lee Gao April 22th, 2016 Guideline Bag-of-words (bag-of-n-grams) Today High dimensional, sparse representation dimension reductions LSA, LDA, MNIR Neural networks Backpropagation
More informationImproving Topic Models with Latent Feature Word Representations
Improving Topic Models with Latent Feature Word Representations Dat Quoc Nguyen Joint work with Richard Billingsley, Lan Du and Mark Johnson Department of Computing Macquarie University Sydney, Australia
More informationA Unified Learning Framework of Skip-Grams and Global Vectors
A Unified Learning Framework of Skip-Grams and Global Vectors Jun Suzuki and Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationDeep learning for Natural Language Processing and Machine Translation
Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationLearning Features from Co-occurrences: A Theoretical Analysis
Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences
More informationA Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution
A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution Shaohua Li 1, Jun Zhu 2, Chunyan Miao 1 1 Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationAPPLIED DEEP LEARNING PROF ALEXIEI DINGLI
APPLIED DEEP LEARNING PROF ALEXIEI DINGLI TECH NEWS TECH NEWS HOW TO DO IT? TECH NEWS APPLICATIONS TECH NEWS TECH NEWS NEURAL NETWORKS Interconnected set of nodes and edges Designed to perform complex
More informationDeep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationDiscriminative Training
Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder
More informationWeb-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Web-Mining Agents Multi-Relational Latent Semantic Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Übungen) Acknowledgements Slides by: Scott Wen-tau
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 1: Introduction and Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1: Introduction and Word Vectors Lecture Plan Lecture 1: Introduction and Word Vectors 1. The course (10 mins)
More informationCMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationRiemannian Optimization for Skip-Gram Negative Sampling
Riemannian Optimization for Skip-Gram Negative Sampling Alexander Fonarev 1,2,4,*, Oleksii Hrinchuk 1,2,3,*, Gleb Gusev 2,3, Pavel Serdyukov 2, and Ivan Oseledets 1,5 1 Skolkovo Institute of Science and
More informationSKIP-GRAM WORD EMBEDDINGS IN HYPERBOLIC
SKIP-GRAM WORD EMBEDDINGS IN HYPERBOLIC SPACE Anonymous authors Paper under double-blind review ABSTRACT Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationRandom Coattention Forest for Question Answering
Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More information