Natural Language Processing
|
|
- Bernard Ferguson
- 6 years ago
- Views:
Transcription
1 Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning
2 Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization Evaluation 2
3 Representing words 3
4 Representing words Definition: meaning (Webster dictionary) the idea that is represented by a word, phrase, etc. The idea that a person wants to express by using words, signs, etc. the idea that is expressed in a work of writing, art, etc. In linguistics: signifier < > signified (idea or thing) = denotation 4
5 Representing words with computers A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: is-a relation Hyponym: the opposite of hypernym 5
6 Drawbacks Expensive! Subjective (how to split different synsets?) Incomplete wicked, badass, nifty, crack, ace, wizard, genius, ninja Missing functionality: how do you compute word similarity? How to compose meanings? 6
7 Discrete representation Words are atomic symbols (one-hot representation): V = {hotel, motel, walk, wife, spouse} hotel [ ] motel [ ] walk [ ] wife [ ] spouse [ ] V 100, 000 7
8 Drawback Barack Obama s wife Barack Obama s spouse Barack Obama s wife Barack Obama s advisors Seattle motels Seattle hotels Seattle motels Seattle attractions But all words vectors are orthogonal and equidistant Goal: word vectors with a natural notion of similarity h hotel motel i > h hotel spouse i 8
9 Distributional similarity You shall know a word by the company it keeps (Firth, 1957) cashed a check at the bank across the street that bank holds the mortgage on my home said that the bank raised his forecast for employees of the bank have confessed to the charges Central idea: represent words by their context 9
10 Idea 1 word wife spouse context {met: 3, married: 4, children: 2, wedded: 1, } {met: 2, married: 5, children: 2, kids: 1, } Problem: married <==> wedded children <==> kids 10
11 Distributed representations language = Represent words and context as low dimensional vectors 11
12 Word vectors 12
13 Supervised learning Input: {(x i,y i )} N i=1, (x i,y i ) 2 X Y Output (probabilistic model): f : X! Y arg max y p(y x) Example: train a spam detector from spam and non-spam s. 13 Intro to ML prerequisite
14 Word embeddings that bank holds the mortgage on my home 1. Define supervised learning task from raw text (no manual annotation!): 1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) 14
15 Word embeddings 2. Define model for output given input p(holds bank) p (o c) = exp(u > o v c ) P V w=1 exp(u> wv c ) u: vector for outside word, v: vector for center word, V: number of words in vocabulary, θ: all parameters Multi-class classification model (number of classes?) How many parameters are in the model: =2 V d u,v 2 R d 15 Intro to ML prerequisite
16 Word embeddings 3. Define objective function for corpus of length T: L( ) = TY Y p (w t+j w t ) t=1 m apple j apple m j 6= 0 J( ) = log L( ) = TX X log p (w t+j w t ) t=1 m apple j apple m j 6= 0 Find parameters that maximize the objective 16 Intro to ML prerequisite
17 Word embeddings Intuitions: What probabilities would maximize the objective? Why should similar words have similar vectors? Why do we have different parameters for the center word and the output word? c(x, y) =2,c(x, z) =1 J( ) =p(y x) 2 p(z x) = p(y x) 2 (1 p(y x)) rj( ) =2p(y x) 3p(y x) 2 = p(y x)(2 3p(y x)) p(y x) = 2 3,p(z x) = Intro to ML prerequisite
18 18
19 Gradient descent How to find the right model parameters? Start at some point and move in the opposite direction of the gradient 19 Intro to ML prerequisite
20 Gradient descent f(x) =x 4 +3x 3 +2 f 0 (x) =4x 3 +9x 2 20 Intro to ML prerequisite
21 Gradient descent We want to minimize: TX J( ) = t=1 X j log p (w t+j w t ) Update rule: new j = old j new = old 2 R j rj( ) α is a step size 21 Intro to ML prerequisite
22 Stochastic gradient descent For large corpora (billions of tokens) this update is very slow Sample a window t Update gradients based on that window new = old rj t ( ) 22 Intro to ML prerequisite
23 Deriving the gradient Mostly applications of the chain rule Let s derive the gradient of a window (t) and an center word You will do this again in the assignment (and more) log p (w t+j w t ) 23
24 Whiteboard 24
25 Class 2: recap Goal: represent words with low-dimensional vectors Approach: Define a supervised learning problem from a corpus We defined the necessary components for skip-gram: Model (softmax over word labels for each word) Objective (minimize Negative Log Likelihood) Optimize with SGD We computed the gradient for some parameters by hand 25
26 Computational problem Computing the partition function is too expensive Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log V by constructing a binary tree over the vocabulary Solution 2: Change the model skip-gram with negative sampling (home assignment 1) 26
27 Logistic regression (x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0) What information is lost? X p(y =1 o, c) =? o2v 27
28 Logistic regression Model: p (y =1 c, o) = exp( u > o v c ) = (u> o v c ) p (y =0 c, o) =1 (u > o v c )= ( u > o v c ) Objective: X log( (u > w t+j v wt )) + X log( ( u > w (k) v wt )) t,j k p(w) p(w) = U(w) 3/4 / Z 28 Intro to ML prerequisite
29 Summary We defined the three necessary components. Model (binary classification) Objective (ML with negative sampling) Optimization method (SGD) 29
30 Many variants CBOW: predict center word from context Defining context: How big is the window? Is it sequential or based on syntactic information? Different model for every context position? Use stop words? 30
31 Matrix factorization 31
32 Landauer and Dumais (1997) Matrix factorization Consider the word-context co-occurrence matrix for a corpus: I like deep learning. I like NLP. I enjoy flying. I Like enjoy deep learning NLP flying. I 2 1 like enjoy 1 1 deep 1 1 learning 1 1 NLP 1 1 flying
33 Matrix factorization Reconstruct matrix from low-dimensional wordcontext representations. Minimizes: X (A ij  k ij) 2 = A Âk 2 i,j 33
34 Matrix factorization 34
35 Levy and Goldberg, 2015 Relation to skip-gram The output of skip-gram can be viewed as factorizing a word-context matrix: M V U T = M 2 R V V,V,U 2 R V d What should the values of M be? Mco is <vc, uo> 35
36 Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c = X c,o #(c) k X o #(o) V #(c, o) log( (u > o v c )) + #(c) k #(o) V log( ( u > o v c )) log( ( u > o v c ))
37 Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o = X c,o #(c, o) log( (u > o v c )) + X c #(c) k X o #(o) T log( ( u> o v c )) #(c, o) log( (u > o v c )) + #(c) k #(o) T log( ( u> o v c )) 37
38 Relation to skip-gram Let s assume the dot products are independent of one another: Let x = u > o v c l(x) =#(c, o) log( (x)) + #(c) k #(o) T L( ) = X c,o l(x) log( ( #(o) =#(c, o) ( x) #(c) k (x) T #(c, o) T x = log #(c) #(o) 1 k p(c, o) x = log log k =PMI(c, o) p(c) p(o) 38 log k
39 Relation to skip-gram Conclusion: Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors. 39
40 Evaluation 40
41 Evaluation Intrinsic vs. extrinsic evaluation: Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm Extrinsic: check whether your output is useful in a real NLP task 41
42 Intrinsic evaluation Word analogies: Normalize all word vectors to 1 man::woman < > king::?? a::b < > c::d d = arg max i (x b x a + x c ) > x i x b x a + x c Does cosine distance capture semantic and syntactic intuitions? 42
43 Visualization 43
44 Visualization 44
45 Visualization 45
46 Word analogies evaluation 46
47 Human correlation intrinsic evaluation word 1 word 2 human judgement tiger cat 7.35 book paper 7.46 computer internet 7.58 plane car 5.77 stock phone 1.62 stock CD 1.31 stock jaguar
48 Human correlation intrinsic evaluation Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353): 48
49 Extrinsic evaluation Task: named entity recognition. Find mentions of person, location, organization in text. Using good word representation might be useful 49
50 Extrinsic evaluation 50
51 Summary Words are central to language In most NLP systems some word representations are used Graph-based representations are difficult to manipulate and compose One-hot vectors are useful with enough data but lose all of generalization information Word embeddings provide a compact way to encode word meaning and similarity Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem It is related to classical matrix decomposition methods. 51
52 Assignment 1 Implement skip-gram with negative sampling There is ample literature if you want to consider this for a project 52
53 Gradient =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 53
Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions
More informationGloVe: Global Vectors for Word Representation 1
GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 1: Introduction and Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1: Introduction and Word Vectors Lecture Plan Lecture 1: Introduction and Word Vectors 1. The course (10 mins)
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,
More informationLecture 7: Word Embeddings
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationarxiv: v3 [cs.cl] 30 Jan 2016
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention
More informationNeural Word Embeddings from Scratch
Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings
More informationWord Embeddings 2 - Class Discussions
Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationDeep Learning. Ali Ghodsi. University of Waterloo
University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationThe representation of word and sentence
2vec Jul 4, 2017 Presentation Outline 2vec 1 2 2vec 3 4 5 6 discrete representation taxonomy:wordnet Example:good 2vec Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationLogistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logistic Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationEmbeddings Learned By Matrix Factorization
Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationCSC321 Lecture 7 Neural language models
CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We
More informationDistributional Semantics and Word Embeddings. Chase Geigle
Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 What is a word? dog 2 What is a word? dog canine 2 What is a word? dog canine 3 2 What is a word? dog 3 canine 399,999 2 What is a
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationCPSC 340 Assignment 4 (due November 17 ATE)
CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationDeep Learning. Language Models and Word Embeddings. Christof Monz
Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationNatural Language Processing with Deep Learning CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part I 2 Winter 2017 1 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume
More informationlecture 6: modeling sequences (final part)
Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationThe Perceptron. Volker Tresp Summer 2014
The Perceptron Volker Tresp Summer 2014 1 Introduction One of the first serious learning machines Most important elements in learning tasks Collection and preprocessing of training data Definition of a
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)
More informationDeep Learning Sequence to Sequence models: Attention Models. 17 March 2018
Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:
More informationNatural Language Processing
Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting
More informationEmbedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS
Embedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS Probabilistic Models: Downsides Limitation to Logical Relations Embeddings Representation restricted by manual design Clustering? Assymetric
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationRecurrent Neural Network
Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks
More informationClassification Based on Probability
Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely
More informationLoss Functions and Optimization. Lecture 3-1
Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative: Live Questions We ll use Zoom to take questions from remote students live-streaming the lecture Check Piazza for instructions and
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More information