Distributional Semantics and Word Embeddings. Chase Geigle
|
|
- Louisa Carr
- 5 years ago
- Views:
Transcription
1 Distributional Semantics and Word Embeddings Chase Geigle
2 What is a word? dog 2
3 What is a word? dog canine 2
4 What is a word? dog canine 3 2
5 What is a word? dog 3 canine 399,999 2
6 What is a word? dog canine 399,999 2
7 What is a word? dog canine 399,
8 What is a word? dog canine 399, Sparsity! 2
9 The Scourge of Sparsity In the real world, sparsity hurts us: Low vocabulary coverage in training high OOV rate in applications (poor generalization) one-hot representations huge parameter counts information about one word completely unutilized for another! Thus, a good representation must: reduce parameter space, improve generalization, and somehow transfer or share knowledge between words 3
10 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. 4
11 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. A bit of foreshadowing: A word ought to be able to predict its context (word2vec Skip-Gram) A context ought to be able to predict its missing word (word2vec CBOW) 4
12 Brown Clustering
13 Language Models A language model is a 5
14 Language Models A language model is a probability distribution over 5
15 Language Models A language model is a probability distribution over word sequences. 5
16 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 5
17 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 Bigram language model p(w i w i 1, θ): p(w θ) = N p(w i w i 1, θ) i=1 5
18 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 Bigram language model p(w i w i 1, θ): p(w θ) = N p(w i w i 1, θ) i=1 n-gram language model: p(w n w n 1 1, θ) p(w θ) = N i=1 p(w i w i 1 i n+1, θ) 5
19 A Class-Based Language Model 1 Main Idea: cluster words into a fixed number of clusters C and use their cluster assignments as their identity instead (reducing sparsity) If π : V C is a mapping function from a word type to a cluster (or class), we want to find π = arg max p(w π) where N p(w π) = p(c i c i 1 )p(w i c i ). i=1 1 Peter F. Brown et al. Class-based N-gram Models of Natural Language. In: Comput. Linguist (Dec. 1992), pp
20 Finding the best partition π = arg max π One can derive 2 L(π) to be w P(W π) = arg max log P(W π) π L(π) = n w log n w + n ci,c j log n ci n cj ci }{{}},cj {{} (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) nc i,c j 2 Sven Martin, Jörg Liermann, and Hermann Ney. Algorithms for Bigram and Trigram Word Clustering. In: Speech Commun (Apr. 1998), pp
21 Finding the best partition L(π) = w n w log n w + n ci,c j log n c i,c j n c }{{} i,c ci n cj j }{{} (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) What does maximizing this mean? Recall MI(c, c ) = c i,c j p(c i, c j ) log p(c i, c j ) p(c i )p(c j ) which can be shown to be MI(c, c 1 ) = n ci,c }{{} N j log c i,c j (constant) n ci,cj n ci n cj + log N }{{} (constant) and thus maximizing MI of adjacent classes selects the best π. 8
22 Finding the best partition (cont d) π = arg max π n w log n w w }{{} (nearly) unigram entropy (fixed w.r.t. π) + c i,c j n ci,c j log n c i,c j n ci n cj }{{} (nearly) mutual information (varies with π) Direct maximization is intractable! Thus, agglomerative (bottom-up) clustering is used as a greedy heuristic. The best merge is determined by the lowest loss in average mutual information. 9
23 Agglomerative Clustering 10
24 Do they work? (Yes.) Named entity recognition: Scott Miller, Jethran Guinness, and Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. In: HLT-NAACL. 2004, pp Lev Ratinov and Dan Roth. Design Challenges and Misconceptions in Named Entity Recognition. In: Proc. CoNLL. 2009, pp Dependency parsing: Terry Koo, Xavier Carreras, and Michael Collins. Simple Semi-supervised Dependency Parsing. In: Proc. ACL: HLT. June 2008, pp Jun Suzuki and Hideki Isozaki. Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data. In: Proc. ACL: HLT. June 2008, pp Constituency parsing: Marie Candito and Benoı t Crabbé. Improving Generative Statistical Parsing with Semi-supervised Word Clustering. In: IWPT. 2009, pp Muhua Zhu et al. Fast and Accurate Shift-Reduce Constituent Parsing. In: ACL. Aug. 2013, pp
25 Vector Spaces for Word Representation
26 Toward Vector Spaces Brown clusters are nice, but limiting: Arbitrary cut-off point two words may be assigned different classes arbitrarily because of our chosen cutoff Definition of similarity is limited to only bigrams of classes 3 Completely misses lots of regularity that s present in language 3 There are adaptations of Brown clustering that move past this, but what we ll see today is still even better. 12
27 What do I mean by regularity? woman is to sister as man is to 13
28 What do I mean by regularity? woman is to sister as man is to brother 13
29 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to 13
30 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow 13
31 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to 13
32 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen 13
33 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to 13
34 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten 13
35 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to 13
36 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to cried 13
37 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to cried The differences between each pair of words are similar. Can our word representations capture this? (demo) 13
38 14
39 15
40 Neural Word Embeddings
41 Associate a low-dimensional, dense vector w with each word w V so that similar words (in a distributional sense) share a similar vector representation. 15
42 If I only had a vector: The Analogy Problem To solve analogy problems of the form w a is to w b as w c is to what?, we can simply compute a query vector q = w b w a + w c and find the most similar word vector v W to q. If we normalize q to unit-length ˆq = q q and assume each vector in W is also unit-length, this reduces to computing arg max v ˆq v W and returning the associated word v. 16
43 The CBOW and Skip-Gram Models 4 (word2vec) INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) w(t-1) SUM w(t-1) w(t) w(t) w(t+1) w(t+1) w(t+2) w(t+2) CBOW Skip-gram 4 Tomas Mikolov et al. Efficient estimation of word representations in vector space. In: ICLR Workshop (2013). 17
44 Skip-Gram, Mathematically Training corpus w 1, w 2,..., w N (with N typically in the billions) from a fixed vocabulary V. Goal is to maximize the average log probability: 1 N N i=1 L k L;k 0 log p(w i+k w i ) Associate with each w V an input vector w R d and an output vector w R d. Model context probabilities as p(c w) = exp(w c) c V exp(w c ). The problem? V is huge! log p(c w) takes time O( V ) to compute! 18
45 Negative Sampling 5 Given a pair (w, c), can we determine if this came from our corpus or not? Model probabilistically as p(d = 1 w, c) = σ(w c) = exp( w c). Goal: maximize p(d = 1 w, c) for pairs (w, c) that occur in the data. Also maximize p(d = 0 w, c N ) for (w, c N ) pairs where c N is sampled randomly from the empirical unigram distribution. 5 Tomas Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems , pp
46 Skip-Gram Negative Sampling Objective Locally, l = log σ(w c) + k E cn p D (c N ) [log σ( w c N )] and thus globally L = (n w,c ) ( log σ(w c) + k E cn P n(c N ) [log σ( w c N )] ) w V c V k: number of negative samples to take (hyperparameter) n w,c : number of times (w, c) was seen in the data E cn P n(c N ) indicates an expectation taken with respect to the noise distribution P n (c N ). 20
47 Implementation Details (Read: Useful Heuristics) 1. What is the noise distribution? P n (c N ) = (n c N /N) 3/4 (which is the empirical unigram distribution raised to the 3/4 power). 2. Frequent words can dominate the loss. Throw away word w i in the training data according to t P(w i ) = 1 where t is some threshold (like 10 5 ). Z n wi Both are essentially unexplained by Mikolov et al. 21
48 What does this actually do? It turns out this is nothing that new! Levy and Goldberg 6 show that SGNS is implicitly factorizing the matrix M i,j = w i c j = PMI(w i, c j ) log k = log p(w i, c j ) p(w i )p(c j ) log k using an objective that weighs deviations in more frequent (w, c) pairs more strongly than less frequent ones. Thus 6 Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In: Advances in Neural Information Processing Systems , pp
49 Matrix Factorization Methods for Word Embeddings
50 Why not just SVD? Since SGNS is just factorizing a shifted version of the PMI matrix, why not just perform a rank-k SVD on the PMI matrix directly? Assignment 6! With a little TLC, this method can actually work and does surprisingly well. 23
51 GloVe: Global Vectors for Word Representation 7 How can we capture words related to ice, but not to steam? Prob. or Ratio w k = solid w k = gas w k = water w k = fashion P(w k ice) P(w k steam) P(w k ice) P(w k steam) Probability ratios are most informative: solid is related to ice but not steam gas is related to steam but not ice water and fashion do not discriminate between ice or steam (ratios close to 1) 7 Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. In: EMNLP. 2014, pp
52 Building a model: intuition We would like for the vectors w i, w j, and w k to be able to capture the information present in the probability ratio. If we set then we can easily see that w i w k = log P(w k w i ) (w i w j ) w k = log P(w k w i ) log P(w k w j ) = log P(w k w i ) P(w k w j ), ensuring that the information present in the co-occurrence probability ratio is expressed in the vector space. 25
53 GloVe Objective w i w j + b i + b j = log X ij Again: two sets of vectors: target vectors w and context vectors w. X ij = n wi,w j is the number of times w j appears in the context of w i. Problem: all co-occurrences are weighted equally J = V f (X ij )(w i w j + b i + b j log X ij ) 2 i,j=1 Final objective is just weighted least-squares. f (X ij ) serves as a dampener, lessening the weight of the rare co-occurrences. ( ) α x x f (x) = max if x < xmax 1 otherwise where α s default is (you guessed it) 3/4, and x max s default is
54 Relation to word2vec GloVe Objective: V J = f (X ij )(w i i,j=1 w j + b i + b j log X ij ) 2 word2vec (Skip-Gram) Objective (after rewriting): V V J = P(w j w i ) log Q(w j w i ) i=1 X i j=1 where X i = k X ik and P and Q are the empirical co-occurrence and model co-occurrence distributions, respectively. Weighted cross-entropy error! Authors show that replacing cross-entropy with least-squares nearly re-derives GloVe: Ĵ = i,j X i (w i w j log X ij ) 2 27
55 Model Complexity word2vec: O( C ), linear in corpus size GloVe naïve estimate: O( V 2 ), square of vocab size But it actually only depends on the number of non-zero entries in X. If co-occurrences are modeled via a power law, then we have X ij = k (r ij ) α. Modeling C under this assumption, the authors eventually arrive at { O( C ) if α < 1 X = O( C 1/α ) otherwise where the corpora studied in the paper were well modeled with α = 1.25, leading to O( C 0.8 ). In practice, it s faster than word2vec. 28
56 Accuracy vs Vector Size Accuracy [%] Semantic Syntactic Overall Vector Dimension 29
57 Accuracy vs Window Size Accuracy [%] Semantic Syntactic Overall Accuracy [%] Semantic Syntactic Overall Window Size Window Size Left: Symmetric window, Right: Asymmetric window 30
58 Accuracy vs Corpus Choice 85 Semantic Syntactic Overall 80 Accuracy [%] Wiki2010 1B tokens Wiki B tokens Gigaword5 4.3B tokens Gigaword5 + Wiki2014 6B tokens Common Crawl 42B tokens 31
59 Training Time 72 Training Time (hrs) Accuracy [%] GloVe Skip-Gram Iterations (GloVe) Negative Samples (Skip-Gram) 32
60 This evaluation is not quite fair! 32
61 But wait (Referring to word2vec) the code is designed only for a single training epoch specifies a learning schedule specific to one pass through the data, making a modification for multiple passes a non-trivial task. 33
62 But wait (Referring to word2vec) the code is designed only for a single training epoch specifies a learning schedule specific to one pass through the data, making a modification for multiple passes a non-trivial task. This is false, and a bad excuse. How could you modify word2vec to support multiple epochs with zero effort? 33
63 But wait we choose to use the sum W + W as our word vectors. 34
64 But wait we choose to use the sum W + W as our word vectors. word2vec also learns W! Why not do the same for both? 34
65 Embedding performance is very much a function of your hyperparameters! 34
66 Investigating Hyperparameters 8 Levy, Goldberg, and Dagan perform a systematic evaluation of word2vec, SVD, and GloVe where hyperparameters are controlled for and optimized across different methods for a variety of tasks. Interesting results: And: MSR s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD. SGNS outperforms GloVe in every task. 8 Omer Levy, Yoav Goldberg, and Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. In: Transactions of the Association for Computational Linguistics 3 (2015), pp
67 But wait (mark II) Both GloVe and word2vec use context window weighting: word2vec: samples a window size l [1, L] for each token before extracting counts GloVe: weighs contexts by their distance from the target word using the harmonic function 1 d What is the impact of using the different context window strategies between methods? Levy, Goldberg, and Dagan did not investigate this, instead using word2vec s method across all of their evaluations. 36
68 Evaluating things properly is non-obvious and often very difficult! 36
69 What You Should Know
70 What You Should Know What is the sparsity problem in NLP? What is the distributional hypothesis? How does Brown clustering work? What is it trying to maximize? What is agglomerative clustering? What is a word embedding? What is the difference between the CBOW and Skip-Gram models? What is negative sampling and why is it useful? What is the learning objective for SGNS? What is the matrix that SGNS is implicitly factorizing? What is the learning objective for GloVe? What are some examples of hyperparameters for word embedding methods? 37
71 Implementations word2vec: The original implementation: Yoav Goldberg s word2vecf modification (multiple epochs, arbitrary context features): Python implementation in gensim: GloVe: The original implementation: MeTA: R implementation text2vec: 38
72 Thanks! 38
GloVe: Global Vectors for Word Representation 1
GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationDeep Learning. Ali Ghodsi. University of Waterloo
University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the
More informationNeural Word Embeddings from Scratch
Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationEmbeddings Learned By Matrix Factorization
Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix
More informationWord Embeddings 2 - Class Discussions
Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationWord2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.
c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE 113 0033 2 35 10 4F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationA Unified Learning Framework of Skip-Grams and Global Vectors
A Unified Learning Framework of Skip-Grams and Global Vectors Jun Suzuki and Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationMatrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations Alexandre Salle 1 Marco Idiart 2 Aline Villavicencio 1 1 Institute of Informatics 2 Physics Department
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationarxiv: v3 [cs.cl] 30 Jan 2016
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention
More informationInstructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)
Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Simone Teufel (Lead demonstrator Guy Aglionby) sht25@cl.cam.ac.uk; ga384@cl.cam.ac.uk This is the
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,
More informationNeural Networks for NLP. COMP-599 Nov 30, 2016
Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationSemantics with Dense Vectors
Speech and Language Processing Daniel Jurafsky & James H Martin Copyright c 2016 All rights reserved Draft of August 7, 2017 CHAPTER 16 Semantics with Dense Vectors In the previous chapter we saw how to
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationLanguage Modeling. Michael Collins, Columbia University
Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationNatural Language Processing (CSE 517): Cotext Models (II)
Natural Language Processing (CSE 517): Cotext Models (II) Noah Smith c 2016 University of Washington nasmith@cs.washington.edu January 25, 2016 Thanks to David Mimno for comments. 1 / 30 Three Kinds of
More informationLecture 7: Word Embeddings
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationThe representation of word and sentence
2vec Jul 4, 2017 Presentation Outline 2vec 1 2 2vec 3 4 5 6 discrete representation taxonomy:wordnet Example:good 2vec Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate
More informationMulti-View Learning of Word Embeddings via CCA
Multi-View Learning of Word Embeddings via CCA Paramveer S. Dhillon Dean Foster Lyle Ungar Computer & Information Science Statistics Computer & Information Science University of Pennsylvania, Philadelphia,
More informationPrepositional Phrase Attachment over Word Embedding Products
Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe Prepositional Phrases I
More informationChapter 3: Basics of Language Modelling
Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple
More informationDeep Learning. Language Models and Word Embeddings. Christof Monz
Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationA fast and simple algorithm for training neural probabilistic language models
A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationModeling Topics and Knowledge Bases with Embeddings
Modeling Topics and Knowledge Bases with Embeddings Dat Quoc Nguyen and Mark Johnson Department of Computing Macquarie University Sydney, Australia December 2016 1 / 15 Vector representations/embeddings
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationLearning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations
Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationLearning Features from Co-occurrences: A Theoretical Analysis
Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25
More informationDeep Learning For Mathematical Functions
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationFoundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model
Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January
More informationStatistical Methods for NLP
Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,
More informationLanguage Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016
Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import
More informationContinuous Space Language Model(NNLM) Liu Rong Intern students of CSLT
Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT 2013-12-30 Outline N-gram Introduction data sparsity and smooth NNLM Introduction Multi NNLMs Toolkit Word2vec(Deep learing in NLP)
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationNatural Language Processing (CSE 517): Cotext Models
Natural Language Processing (CSE 517): Cotext Models Noah Smith c 2018 University of Washington nasmith@cs.washington.edu April 18, 2018 1 / 32 Latent Dirichlet Allocation (Blei et al., 2003) Widely used
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part I 2 Winter 2017 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume
More informationMore Smoothing, Tuning, and Evaluation
More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationDeep Learning: A Statistical Perspective
Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June,
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationNatural Language Processing. Statistical Inference: n-grams
Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationEmpirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model
Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationANLP Lecture 6 N-gram models and smoothing
ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence
More informationRecap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP
Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationPAC Generalization Bounds for Co-training
PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research
More informationNatural Language Processing SoSe Words and Language Model
Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationFactorization of Latent Variables in Distributional Semantic Models
Factorization of Latent Variables in Distributional Semantic Models Arvid Österlund and David Ödling KTH Royal Institute of Technology, Sweden arvidos dodling@kth.se Magnus Sahlgren Gavagai, Sweden mange@gavagai.se
More informationGenerative Models for Sentences
Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014 Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE)
More informationLog-linear models (part 1)
Log-linear models (part 1) CS 690N, Spring 2018 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2018/ Brendan O Connor College of Information and Computer Sciences University
More informationperplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)
Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More
More informationSYNTHER A NEW M-GRAM POS TAGGER
SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954
More informationNatural Language Processing (CSE 490U): Language Models
Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More information