ATASS: Word Embeddings

Similar documents
Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture 6: Neural Networks for Representing Word Meaning

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Natural Language Processing

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Feedforward Networks. Sargur N. Srihari

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Intro to AI Bert Huang Virginia Tech

text classification 3: neural networks

Deep Learning Architectures and Algorithms

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

word2vec Parameter Learning Explained

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Statistical NLP for the Web

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning (CSE 446): Backpropagation

How to do backpropagation in a brain

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Latent Semantic Analysis. Hongning Wang

Natural Language Processing and Recurrent Neural Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Feedforward Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Intelligence

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning Recurrent Networks 2/28/2018

Neural Networks for NLP. COMP-599 Nov 30, 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Dimensionality Reduction and Principle Components Analysis

Deep Learning for NLP Part 2

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning for NLP

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

An overview of word2vec

Approximate Q-Learning. Dan Weld / University of Washington

Introduction to Deep Neural Networks

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Conditional Language modeling with attention

Neural networks. Chapter 20. Chapter 20 1

ECE521 Lectures 9 Fully Connected Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Artificial Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Data Mining Part 5. Prediction

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

UNSUPERVISED LEARNING

GloVe: Global Vectors for Word Representation 1

y(x n, w) t n 2. (1)

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning for Natural Language Processing

Artificial Neural Networks. MGS Lecture 2

arxiv: v3 [cs.lg] 14 Jan 2018

Course 395: Machine Learning - Lectures

Basic Principles of Unsupervised and Unsupervised

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Logistic Regression & Neural Networks

Neural Networks and the Back-propagation Algorithm

Neural Word Embeddings from Scratch

Task-Oriented Dialogue System (Young, 2000)

Notes on Back Propagation in 4 Lines

Lecture 7: Word Embeddings

STA141C: Big Data & High Performance Statistical Computing

Natural Language Processing with Deep Learning CS224N/Ling284

Statistical Machine Learning from Data

Feature Design. Feature Design. Feature Design. & Deep Learning

DISTRIBUTIONAL SEMANTICS

4. Multilayer Perceptrons

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$18:$Neural'Network'/'Deep' Learning$

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Neural Network Language Modeling

Neural Networks Lecture 4: Radial Bases Function Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Introduction to Machine Learning Spring 2018 Note Neural Networks

Neural Networks Language Models

Neural Networks in Structured Prediction. November 17, 2015

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Learning for NLP

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

arxiv: v3 [cs.cl] 30 Jan 2016

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Unit III. A Survey of Neural Network Model

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks: Backpropagation

Convolutional Neural Networks

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Introduction to Machine Learning

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Transcription:

ATASS: Word Embeddings Lee Gao April 22th, 2016

Guideline Bag-of-words (bag-of-n-grams) Today High dimensional, sparse representation dimension reductions LSA, LDA, MNIR Neural networks Backpropagation algorithm Convolutional neural network Recurrent neural network Word embeddings Continuous bag-of-words Skip-gram Downstream predictions

Neural Networks 1 1 For more information, see A Tutorial on Deep Learning, Quoc V. Le (https://cs.stanford.edu/~quocle)

Example: Example Should I watch the movie Gravity? Both Mary and John rated 3/5. Historical ratings: ( O : I like the movie; X : I do not like the movie)

Decision Function Features: x 1 : Mary s rating, x 2 : John s rating Decision function: ( ) h (x;θ,b) = g θ T 1 x + b, where g (z) = 1 + exp ( z) Objective: min θ,b m i=1 [ ( ) h x (i) ;θ,b y (i)] 2

Neural network illustration Learning: stochastic gradient descent. α is learning rate, a large α will give aggressive updates, a small α will give conservative updates. θ 1 = θ 1 α θ 1 θ 2 = θ 2 α θ 2 b = b α b where the partial derivative (at example i) for θ 1 is, and similar for θ 2 and b ( ( ) θ 1 = h x (i) ;θ,b y (i)) 2 θ [ 1 ( ) = 2 g θ T x (i) + b y (i)][ ( )] ( ) 1 g θ T x (i) + b g θ T x (i) + b x (i) 1

The limitations of linear decision function The limitations of linear decision function: the samples are not linearly separable Problem decomposition: two simple problems that can be using linear models

Neural network illustration Suppose the two decision functions are h 1 (x;(θ 1,θ 2 ),b 1 ), h 2 (x;(θ 3,θ 4 ),b 2 ) Objective min w,c m i=1 Neural network illustration [ h ((h 1 (x (i)) (,h 2 x (i))) ) ;w,c y (i)] 2

The backpropagation algorithm Goal: compute the parameter gradients An implementation of the chain rule specifically designed for neural networks Generalized parameters: θ for weights, b for biases, layers indexed by 1 (input), 2,..., L (output) θ (l) ij : weight connecting neuron i in layer l to neuron j in layer l + 1. b (l) i : bias of neuron i in layer l. Decision functions: h (1) = x ( ( h (2) = g θ (1)) ) T h (1) + b (1) ( ( h (L 1) = g θ (L 2)) ) T h (L 2) + b (L 2) h (x) = h (L) = g ( ( θ (L 1)) T h (L 1) + b (L 1) )

The backpropagation algorithm 1. Perform a feed-forward pass to compute h (1), h (2),..., h (L). 2. For the output layer, compute ( ) ( sl 1 δ (L) = 2 h (L) y )g 1 θ (L 1) i h (L 1) 1 i + b (L 1) 1 where s l is the number of neurons in layer l. 3. Perform a backward pass, for l = L 1, L 2,...,2. For each node j in layer l, compute ( ) ( ) sl+1 sl 1 δ (l) j = θ (l) jk δ (l+1) k g θ (L 1) ij h (L 1) i + b (L 1) j k=1 i=1 i=1 4. The desired partial derivatives can be computed as θ (l) ij b (l) i = h (l) i δ (l+1) j = δ (l+1) i

Deep vs. shallow networks Deep networks are more computationally attractive than shallow networks Much less connections

Convolutional neural networks Networks seen so far, every neuron in the first hidden layer connects to all the neurons in the inputs Does not work when x is high dimensional. Convolutional neural network (CNN): Locally connected neural networks Weight sharing: w 1 = w 4 = w 7, w 2 = w 5 = w 8, w 3 = w 6 = w 9.

Recurrent neural networks x 0, x 1..., x T are labels (e.g.: a sequence of words) h 0, h 1..., h T are the hidden states of the recurrent network. Three sets of parameters: input to hidden weights W, hidden to hidden weights U, hidden to output weights V Data generating process: f (x) = Vh T h t = σ (Uh t 1 + Wx t ), for t = T,..., 1 h 0 = σ (Wx 0 ) Objective: minimizing (y f (x)) 2

Word Embeddings

Word embeddings Weakness of bag-of-words model Word order information is lost: AlphaGo beats Lee / Lee beats AlphaGo -> [ AlphaGo, beats, Lee ] Semantic information is lost: cannot distinguish between the difference between stock and returns and the difference between stock and Africa. High dimensionality Word embeddings Map words into a low dimensional space (relative to vocabulary size). Neural networks Continuous bag-of-words (CBOW) Skip-gram

Continuous bag-of-words Idea: find word vector representations that are useful for predicting a certain word using surrounding words in a sentence or a document. Embedding vectors Word: v w R r for word w V W ; context: v c R r for context c V C. r is the embedding dimensionality, hyper parameter. The probability for word w to appear in context c = (w t l,...,w t 1,w t+1,...,w t+l ), v c = 1 2l l i=1 ( vwt i + v wt+i ). p ( w ) c = σ (vw v c ) 1 + exp( v w v c ) 1

CBOW Objective Negative sampling: choose v w (v c is a deterministic function of v w ) to maximize logσ (v w v c ) + k E wn P(w)σ ( v wn v c ) k: hyper parameter controlling penalization on (w,c) pairs not appearing in the corpus Negative words w N are drawn according to empirical distribution P (w) = #(w) D, where D is the set of (w,c) pairs. Global objective L = #(w,c) [ logσ (v w v c ) + ke wn P(w)σ ( )] v wn v c w VW c VC

CBOW-Doc Architecture Idea: not only each individual word, but also each document is represented as by a dense vector which is trained to predict words in the document. provides a direct way to embed a document into a vector space. Document embedding vector v d R r is directly learned from the neural network model. The probability for word w to appear in context c and document d p ( w c,d ) = σ (vw (αv c + (1 α)v d )) α [0, 1] is the weight assigned to the context vector v c.

Skip-gram Idea: find word vector representations that are useful for predicting surrounding words given a contain word in a sentence or a document. The probability for context c = (w t l,...,w t 1,w t+1,...,w t+l ) to appear around word w p ( c ) w = σ (vc v w ) 1 + exp( v c v w ) 1

Skip-gram Objective Negative sampling: choose v w to maximize logσ (v c v w ) + k E cn P(c)σ ( v cn v w ) k: hyper parameter controlling penalization on (w,c) pairs not appearing in the corpus Negative contexts c N are drawn according to empirical distribution P (c) = #(c) D, where D is the set of (c,w) pairs. Global objective L = #(w,c) [ logσ (v w v c ) + k E cn P(c)σ ( )] v cn v w w VW c VC

Matrix Factorization Word-context matrix is a V W V C matrix M Each row corresponds to a word w V w. Each column corresponds to a context c V c. Each element M wc measures the association between a word and context. Word embedding: factorizing M into a V W r word embedding matrix W and a V C r context embedding matrix C. ( ) CBOW/Skip-gram: M wc = #(w,c) D log #(w) #(c) log k, called shifted point-wise mutual information The stochastic gradient based training method is similar to symmetric SVD: W SVD 1/2 = U r Σ r, where M = UΣV T.

Cosine similarity Word similarities Similarity ( w,w ) = v w v w v w v w Example (letters to shareholders from N-CSR files) china oil politics shareholder 1 chinese commodity terrorism shareholders 2 indonesia energy rhetoric stockholders 3 brazil gasoline political stockholder 4 russia cotton standoff shareowner 5 japan fuel presidential trustees 6 asia gold partisan shareowners 7 turkey brent debate classify 8 states natural threats directors 9 population food uncertainties mergers 10 india ore attacks semiannual

Word Clouds by Sentiments Reduce 300-dimension word vectors to 2-dimension vectors using t-distributed stochastic neighbor embedding (t-sne). Top 30 words similar to good and bad.

Word Clouds by Topics Top 30 words similar to region, politics, investment, macro, index, commodity, shareholder, industry.

Downstream predictions For downstream predictions, we may need document level feature vectors y = β 0 + β x X + β d v d where v d is document embedding vector. Generate document level features Direct learning: a document vector is directly learned from the neural network Taking average: a document vector is the average of word vectors. Denote the word vectors as vw R d, where d is the dimensionality of the word embedding space. A document vector vd = 1 #(w) d w d vw. Clustering: K-means, Spectral Clustering etc. Cluster words using embedding vectors, and represent documents using bag-of-clusters