Distributional Semantics and Word Embeddings. Chase Geigle

Size: px
Start display at page:

Download "Distributional Semantics and Word Embeddings. Chase Geigle"

Transcription

1 Distributional Semantics and Word Embeddings Chase Geigle

2 What is a word? dog 2

3 What is a word? dog canine 2

4 What is a word? dog canine 3 2

5 What is a word? dog 3 canine 399,999 2

6 What is a word? dog canine 399,999 2

7 What is a word? dog canine 399,

8 What is a word? dog canine 399, Sparsity! 2

9 The Scourge of Sparsity In the real world, sparsity hurts us: Low vocabulary coverage in training high OOV rate in applications (poor generalization) one-hot representations huge parameter counts information about one word completely unutilized for another! Thus, a good representation must: reduce parameter space, improve generalization, and somehow transfer or share knowledge between words 3

10 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. 4

11 The Distributional Hypothesis You shall know a word by the company it keeps. (J. R. Firth, 1957) Words with high similarity occur in the same contexts as one another. A bit of foreshadowing: A word ought to be able to predict its context (word2vec Skip-Gram) A context ought to be able to predict its missing word (word2vec CBOW) 4

12 Brown Clustering

13 Language Models A language model is a 5

14 Language Models A language model is a probability distribution over 5

15 Language Models A language model is a probability distribution over word sequences. 5

16 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 5

17 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 Bigram language model p(w i w i 1, θ): p(w θ) = N p(w i w i 1, θ) i=1 5

18 Language Models A language model is a probability distribution over word sequences. Unigram language model p(w θ): p(w θ) = N p(w i θ) i=1 Bigram language model p(w i w i 1, θ): p(w θ) = N p(w i w i 1, θ) i=1 n-gram language model: p(w n w n 1 1, θ) p(w θ) = N i=1 p(w i w i 1 i n+1, θ) 5

19 A Class-Based Language Model 1 Main Idea: cluster words into a fixed number of clusters C and use their cluster assignments as their identity instead (reducing sparsity) If π : V C is a mapping function from a word type to a cluster (or class), we want to find π = arg max p(w π) where N p(w π) = p(c i c i 1 )p(w i c i ). i=1 1 Peter F. Brown et al. Class-based N-gram Models of Natural Language. In: Comput. Linguist (Dec. 1992), pp

20 Finding the best partition π = arg max π One can derive 2 L(π) to be w P(W π) = arg max log P(W π) π L(π) = n w log n w + n ci,c j log n ci n cj ci }{{}},cj {{} (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) nc i,c j 2 Sven Martin, Jörg Liermann, and Hermann Ney. Algorithms for Bigram and Trigram Word Clustering. In: Speech Commun (Apr. 1998), pp

21 Finding the best partition L(π) = w n w log n w + n ci,c j log n c i,c j n c }{{} i,c ci n cj j }{{} (nearly) unigram entropy (nearly) mutual information (fixed w.r.t. π) (varies with π) What does maximizing this mean? Recall MI(c, c ) = c i,c j p(c i, c j ) log p(c i, c j ) p(c i )p(c j ) which can be shown to be MI(c, c 1 ) = n ci,c }{{} N j log c i,c j (constant) n ci,cj n ci n cj + log N }{{} (constant) and thus maximizing MI of adjacent classes selects the best π. 8

22 Finding the best partition (cont d) π = arg max π n w log n w w }{{} (nearly) unigram entropy (fixed w.r.t. π) + c i,c j n ci,c j log n c i,c j n ci n cj }{{} (nearly) mutual information (varies with π) Direct maximization is intractable! Thus, agglomerative (bottom-up) clustering is used as a greedy heuristic. The best merge is determined by the lowest loss in average mutual information. 9

23 Agglomerative Clustering 10

24 Do they work? (Yes.) Named entity recognition: Scott Miller, Jethran Guinness, and Alex Zamanian. Name Tagging with Word Clusters and Discriminative Training. In: HLT-NAACL. 2004, pp Lev Ratinov and Dan Roth. Design Challenges and Misconceptions in Named Entity Recognition. In: Proc. CoNLL. 2009, pp Dependency parsing: Terry Koo, Xavier Carreras, and Michael Collins. Simple Semi-supervised Dependency Parsing. In: Proc. ACL: HLT. June 2008, pp Jun Suzuki and Hideki Isozaki. Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data. In: Proc. ACL: HLT. June 2008, pp Constituency parsing: Marie Candito and Benoı t Crabbé. Improving Generative Statistical Parsing with Semi-supervised Word Clustering. In: IWPT. 2009, pp Muhua Zhu et al. Fast and Accurate Shift-Reduce Constituent Parsing. In: ACL. Aug. 2013, pp

25 Vector Spaces for Word Representation

26 Toward Vector Spaces Brown clusters are nice, but limiting: Arbitrary cut-off point two words may be assigned different classes arbitrarily because of our chosen cutoff Definition of similarity is limited to only bigrams of classes 3 Completely misses lots of regularity that s present in language 3 There are adaptations of Brown clustering that move past this, but what we ll see today is still even better. 12

27 What do I mean by regularity? woman is to sister as man is to 13

28 What do I mean by regularity? woman is to sister as man is to brother 13

29 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to 13

30 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow 13

31 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to 13

32 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen 13

33 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to 13

34 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten 13

35 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to 13

36 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to cried 13

37 What do I mean by regularity? woman is to sister as man is to brother summer is to rain as winter is to snow man is to king as woman is to queen fell is to fallen as ate is to eaten running is to ran as crying is to cried The differences between each pair of words are similar. Can our word representations capture this? (demo) 13

38 14

39 15

40 Neural Word Embeddings

41 Associate a low-dimensional, dense vector w with each word w V so that similar words (in a distributional sense) share a similar vector representation. 15

42 If I only had a vector: The Analogy Problem To solve analogy problems of the form w a is to w b as w c is to what?, we can simply compute a query vector q = w b w a + w c and find the most similar word vector v W to q. If we normalize q to unit-length ˆq = q q and assume each vector in W is also unit-length, this reduces to computing arg max v ˆq v W and returning the associated word v. 16

43 The CBOW and Skip-Gram Models 4 (word2vec) INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT w(t-2) w(t-2) w(t-1) SUM w(t-1) w(t) w(t) w(t+1) w(t+1) w(t+2) w(t+2) CBOW Skip-gram 4 Tomas Mikolov et al. Efficient estimation of word representations in vector space. In: ICLR Workshop (2013). 17

44 Skip-Gram, Mathematically Training corpus w 1, w 2,..., w N (with N typically in the billions) from a fixed vocabulary V. Goal is to maximize the average log probability: 1 N N i=1 L k L;k 0 log p(w i+k w i ) Associate with each w V an input vector w R d and an output vector w R d. Model context probabilities as p(c w) = exp(w c) c V exp(w c ). The problem? V is huge! log p(c w) takes time O( V ) to compute! 18

45 Negative Sampling 5 Given a pair (w, c), can we determine if this came from our corpus or not? Model probabilistically as p(d = 1 w, c) = σ(w c) = exp( w c). Goal: maximize p(d = 1 w, c) for pairs (w, c) that occur in the data. Also maximize p(d = 0 w, c N ) for (w, c N ) pairs where c N is sampled randomly from the empirical unigram distribution. 5 Tomas Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems , pp

46 Skip-Gram Negative Sampling Objective Locally, l = log σ(w c) + k E cn p D (c N ) [log σ( w c N )] and thus globally L = (n w,c ) ( log σ(w c) + k E cn P n(c N ) [log σ( w c N )] ) w V c V k: number of negative samples to take (hyperparameter) n w,c : number of times (w, c) was seen in the data E cn P n(c N ) indicates an expectation taken with respect to the noise distribution P n (c N ). 20

47 Implementation Details (Read: Useful Heuristics) 1. What is the noise distribution? P n (c N ) = (n c N /N) 3/4 (which is the empirical unigram distribution raised to the 3/4 power). 2. Frequent words can dominate the loss. Throw away word w i in the training data according to t P(w i ) = 1 where t is some threshold (like 10 5 ). Z n wi Both are essentially unexplained by Mikolov et al. 21

48 What does this actually do? It turns out this is nothing that new! Levy and Goldberg 6 show that SGNS is implicitly factorizing the matrix M i,j = w i c j = PMI(w i, c j ) log k = log p(w i, c j ) p(w i )p(c j ) log k using an objective that weighs deviations in more frequent (w, c) pairs more strongly than less frequent ones. Thus 6 Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In: Advances in Neural Information Processing Systems , pp

49 Matrix Factorization Methods for Word Embeddings

50 Why not just SVD? Since SGNS is just factorizing a shifted version of the PMI matrix, why not just perform a rank-k SVD on the PMI matrix directly? Assignment 6! With a little TLC, this method can actually work and does surprisingly well. 23

51 GloVe: Global Vectors for Word Representation 7 How can we capture words related to ice, but not to steam? Prob. or Ratio w k = solid w k = gas w k = water w k = fashion P(w k ice) P(w k steam) P(w k ice) P(w k steam) Probability ratios are most informative: solid is related to ice but not steam gas is related to steam but not ice water and fashion do not discriminate between ice or steam (ratios close to 1) 7 Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. In: EMNLP. 2014, pp

52 Building a model: intuition We would like for the vectors w i, w j, and w k to be able to capture the information present in the probability ratio. If we set then we can easily see that w i w k = log P(w k w i ) (w i w j ) w k = log P(w k w i ) log P(w k w j ) = log P(w k w i ) P(w k w j ), ensuring that the information present in the co-occurrence probability ratio is expressed in the vector space. 25

53 GloVe Objective w i w j + b i + b j = log X ij Again: two sets of vectors: target vectors w and context vectors w. X ij = n wi,w j is the number of times w j appears in the context of w i. Problem: all co-occurrences are weighted equally J = V f (X ij )(w i w j + b i + b j log X ij ) 2 i,j=1 Final objective is just weighted least-squares. f (X ij ) serves as a dampener, lessening the weight of the rare co-occurrences. ( ) α x x f (x) = max if x < xmax 1 otherwise where α s default is (you guessed it) 3/4, and x max s default is

54 Relation to word2vec GloVe Objective: V J = f (X ij )(w i i,j=1 w j + b i + b j log X ij ) 2 word2vec (Skip-Gram) Objective (after rewriting): V V J = P(w j w i ) log Q(w j w i ) i=1 X i j=1 where X i = k X ik and P and Q are the empirical co-occurrence and model co-occurrence distributions, respectively. Weighted cross-entropy error! Authors show that replacing cross-entropy with least-squares nearly re-derives GloVe: Ĵ = i,j X i (w i w j log X ij ) 2 27

55 Model Complexity word2vec: O( C ), linear in corpus size GloVe naïve estimate: O( V 2 ), square of vocab size But it actually only depends on the number of non-zero entries in X. If co-occurrences are modeled via a power law, then we have X ij = k (r ij ) α. Modeling C under this assumption, the authors eventually arrive at { O( C ) if α < 1 X = O( C 1/α ) otherwise where the corpora studied in the paper were well modeled with α = 1.25, leading to O( C 0.8 ). In practice, it s faster than word2vec. 28

56 Accuracy vs Vector Size Accuracy [%] Semantic Syntactic Overall Vector Dimension 29

57 Accuracy vs Window Size Accuracy [%] Semantic Syntactic Overall Accuracy [%] Semantic Syntactic Overall Window Size Window Size Left: Symmetric window, Right: Asymmetric window 30

58 Accuracy vs Corpus Choice 85 Semantic Syntactic Overall 80 Accuracy [%] Wiki2010 1B tokens Wiki B tokens Gigaword5 4.3B tokens Gigaword5 + Wiki2014 6B tokens Common Crawl 42B tokens 31

59 Training Time 72 Training Time (hrs) Accuracy [%] GloVe Skip-Gram Iterations (GloVe) Negative Samples (Skip-Gram) 32

60 This evaluation is not quite fair! 32

61 But wait (Referring to word2vec) the code is designed only for a single training epoch specifies a learning schedule specific to one pass through the data, making a modification for multiple passes a non-trivial task. 33

62 But wait (Referring to word2vec) the code is designed only for a single training epoch specifies a learning schedule specific to one pass through the data, making a modification for multiple passes a non-trivial task. This is false, and a bad excuse. How could you modify word2vec to support multiple epochs with zero effort? 33

63 But wait we choose to use the sum W + W as our word vectors. 34

64 But wait we choose to use the sum W + W as our word vectors. word2vec also learns W! Why not do the same for both? 34

65 Embedding performance is very much a function of your hyperparameters! 34

66 Investigating Hyperparameters 8 Levy, Goldberg, and Dagan perform a systematic evaluation of word2vec, SVD, and GloVe where hyperparameters are controlled for and optimized across different methods for a variety of tasks. Interesting results: And: MSR s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD. SGNS outperforms GloVe in every task. 8 Omer Levy, Yoav Goldberg, and Ido Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. In: Transactions of the Association for Computational Linguistics 3 (2015), pp

67 But wait (mark II) Both GloVe and word2vec use context window weighting: word2vec: samples a window size l [1, L] for each token before extracting counts GloVe: weighs contexts by their distance from the target word using the harmonic function 1 d What is the impact of using the different context window strategies between methods? Levy, Goldberg, and Dagan did not investigate this, instead using word2vec s method across all of their evaluations. 36

68 Evaluating things properly is non-obvious and often very difficult! 36

69 What You Should Know

70 What You Should Know What is the sparsity problem in NLP? What is the distributional hypothesis? How does Brown clustering work? What is it trying to maximize? What is agglomerative clustering? What is a word embedding? What is the difference between the CBOW and Skip-Gram models? What is negative sampling and why is it useful? What is the learning objective for SGNS? What is the matrix that SGNS is implicitly factorizing? What is the learning objective for GloVe? What are some examples of hyperparameters for word embedding methods? 37

71 Implementations word2vec: The original implementation: Yoav Goldberg s word2vecf modification (multiple epochs, arbitrary context features): Python implementation in gensim: GloVe: The original implementation: MeTA: R implementation text2vec: 38

72 Thanks! 38

GloVe: Global Vectors for Word Representation 1

GloVe: Global Vectors for Word Representation 1 GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Deep Learning. Ali Ghodsi. University of Waterloo

Deep Learning. Ali Ghodsi. University of Waterloo University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the

More information

Neural Word Embeddings from Scratch

Neural Word Embeddings from Scratch Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Lecture 6: Neural Networks for Representing Word Meaning

Lecture 6: Neural Networks for Representing Word Meaning Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Deep Learning for NLP Part 2

Deep Learning for NLP Part 2 Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

Word Embeddings 2 - Class Discussions

Word Embeddings 2 - Class Discussions Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1

More information

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

Bayesian Paragraph Vectors

Bayesian Paragraph Vectors Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding. c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE 113 0033 2 35 10 4F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

A Unified Learning Framework of Skip-Grams and Global Vectors

A Unified Learning Framework of Skip-Grams and Global Vectors A Unified Learning Framework of Skip-Grams and Global Vectors Jun Suzuki and Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations Alexandre Salle 1 Marco Idiart 2 Aline Villavicencio 1 1 Institute of Informatics 2 Physics Department

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

arxiv: v3 [cs.cl] 30 Jan 2016

arxiv: v3 [cs.cl] 30 Jan 2016 word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention

More information

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Simone Teufel (Lead demonstrator Guy Aglionby) sht25@cl.cam.ac.uk; ga384@cl.cam.ac.uk This is the

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

CS224n: Natural Language Processing with Deep Learning 1

CS224n: Natural Language Processing with Deep Learning 1 CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,

More information

Neural Networks for NLP. COMP-599 Nov 30, 2016

Neural Networks for NLP. COMP-599 Nov 30, 2016 Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University

More information

Semantics with Dense Vectors

Semantics with Dense Vectors Speech and Language Processing Daniel Jurafsky & James H Martin Copyright c 2016 All rights reserved Draft of August 7, 2017 CHAPTER 16 Semantics with Dense Vectors In the previous chapter we saw how to

More information

Seman&cs with Dense Vectors. Dorota Glowacka

Seman&cs with Dense Vectors. Dorota Glowacka Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values

More information

Language Modeling. Michael Collins, Columbia University

Language Modeling. Michael Collins, Columbia University Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Natural Language Processing (CSE 517): Cotext Models (II)

Natural Language Processing (CSE 517): Cotext Models (II) Natural Language Processing (CSE 517): Cotext Models (II) Noah Smith c 2016 University of Washington nasmith@cs.washington.edu January 25, 2016 Thanks to David Mimno for comments. 1 / 30 Three Kinds of

More information

Lecture 7: Word Embeddings

Lecture 7: Word Embeddings Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

The representation of word and sentence

The representation of word and sentence 2vec Jul 4, 2017 Presentation Outline 2vec 1 2 2vec 3 4 5 6 discrete representation taxonomy:wordnet Example:good 2vec Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate

More information

Multi-View Learning of Word Embeddings via CCA

Multi-View Learning of Word Embeddings via CCA Multi-View Learning of Word Embeddings via CCA Paramveer S. Dhillon Dean Foster Lyle Ungar Computer & Information Science Statistics Computer & Information Science University of Pennsylvania, Philadelphia,

More information

Prepositional Phrase Attachment over Word Embedding Products

Prepositional Phrase Attachment over Word Embedding Products Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe Prepositional Phrases I

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

Deep Learning. Language Models and Word Embeddings. Christof Monz

Deep Learning. Language Models and Word Embeddings. Christof Monz Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

A fast and simple algorithm for training neural probabilistic language models

A fast and simple algorithm for training neural probabilistic language models A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Modeling Topics and Knowledge Bases with Embeddings

Modeling Topics and Knowledge Bases with Embeddings Modeling Topics and Knowledge Bases with Embeddings Dat Quoc Nguyen and Mark Johnson Department of Computing Macquarie University Sydney, Australia December 2016 1 / 15 Vector representations/embeddings

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

Natural Language Processing

Natural Language Processing David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics (Sept 19, 2017) David Bamman, UC Berkeley Announcements Homework 2 party today 5-7pm: 202

More information

Natural Language Processing

Natural Language Processing David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25

More information

Deep Learning For Mathematical Functions

Deep Learning For Mathematical Functions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Language Models, Graphical Models Sameer Maskey Week 13, April 13, 2010 Some slides provided by Stanley Chen and from Bishop Book Resources 1 Announcements Final Project Due,

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT 2013-12-30 Outline N-gram Introduction data sparsity and smooth NNLM Introduction Multi NNLMs Toolkit Word2vec(Deep learing in NLP)

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Natural Language Processing (CSE 517): Cotext Models

Natural Language Processing (CSE 517): Cotext Models Natural Language Processing (CSE 517): Cotext Models Noah Smith c 2018 University of Washington nasmith@cs.washington.edu April 18, 2018 1 / 32 Latent Dirichlet Allocation (Blei et al., 2003) Widely used

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

CS224n: Natural Language Processing with Deep Learning 1

CS224n: Natural Language Processing with Deep Learning 1 CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part I 2 Winter 2017 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume

More information

More Smoothing, Tuning, and Evaluation

More Smoothing, Tuning, and Evaluation More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

Deep Learning: A Statistical Perspective

Deep Learning: A Statistical Perspective Introduction Deep Learning: A Statistical Perspective Myunghee Cho Paik Guest lectures by Gisoo Kim, Yongchan Kwon, Young-geun Kim, Wonyoung Kim and Youngwon Choi Seoul National University March-June,

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

PAC Generalization Bounds for Co-training

PAC Generalization Bounds for Co-training PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Factorization of Latent Variables in Distributional Semantic Models

Factorization of Latent Variables in Distributional Semantic Models Factorization of Latent Variables in Distributional Semantic Models Arvid Österlund and David Ödling KTH Royal Institute of Technology, Sweden arvidos dodling@kth.se Magnus Sahlgren Gavagai, Sweden mange@gavagai.se

More information

Generative Models for Sentences

Generative Models for Sentences Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014 Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE)

More information

Log-linear models (part 1)

Log-linear models (part 1) Log-linear models (part 1) CS 690N, Spring 2018 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2018/ Brendan O Connor College of Information and Computer Sciences University

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for

More information