The representation of word and sentence

Similar documents
GloVe: Global Vectors for Word Representation 1

Natural Language Processing and Recurrent Neural Networks

Word Embeddings 2 - Class Discussions

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Natural Language Processing

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Learning. Ali Ghodsi. University of Waterloo

word2vec Parameter Learning Explained

Deep Learning for Natural Language Processing

An overview of word2vec

CS224n: Natural Language Processing with Deep Learning 1

CS224n: Natural Language Processing with Deep Learning 1

Neural Word Embeddings from Scratch

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

arxiv: v3 [cs.cl] 30 Jan 2016

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Natural Language Processing

Lecture 6: Neural Networks for Representing Word Meaning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Seman&cs with Dense Vectors. Dorota Glowacka

Neural Networks Language Models

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Machine Learning for Smart Learners

DISTRIBUTIONAL SEMANTICS

Sequence Modeling with Neural Networks

lecture 6: modeling sequences (final part)

Distributional Semantics and Word Embeddings. Chase Geigle

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Natural Language Processing

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

ATASS: Word Embeddings

Neural Networks for NLP. COMP-599 Nov 30, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Natural Language Processing

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Embeddings Learned By Matrix Factorization

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Deep Learning For Mathematical Functions

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Conditional Language Modeling. Chris Dyer

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Learning. Language Models and Word Embeddings. Christof Monz

Improved Learning through Augmenting the Loss

Generic Text Summarization

Lecture 7: Word Embeddings

Bringing machine learning & compositional semantics together: central concepts

STA141C: Big Data & High Performance Statistical Computing

Features of Statistical Parsers

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Recurrent Neural Networks. Jian Tang

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

Topic Models and Applications to Short Documents

Modeling Environment

Slide credit from Hung-Yi Lee & Richard Socher

Deep Learning for NLP Part 2

Learning to translate with neural networks. Michael Auli

Nonlinear Dimensionality Reduction

The Noisy Channel Model and Markov Models

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

arxiv: v2 [cs.cl] 1 Jan 2019

Deep Learning Recurrent Networks 2/28/2018

Long-Short Term Memory and Other Gated RNNs

Lecture 5 Neural models for NLP

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Bayesian Paragraph Vectors

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Natural Language Processing with Deep Learning CS224N/Ling284

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 12: Algorithms for HMMs

Collapsed Variational Bayesian Inference for Hidden Markov Models

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lab 12: Structured Prediction

STA141C: Big Data & High Performance Statistical Computing

Towards Universal Sentence Embeddings

CSC321 Lecture 10 Training RNNs

Notes on Latent Semantic Analysis

Latent Semantic Analysis. Hongning Wang

1. Ignoring case, extract all unique words from the entire set of documents.

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Intrinsic Subspace Evaluation of Word Embedding Representations

Transcription:

2vec Jul 4, 2017

Presentation Outline 2vec 1 2 2vec 3 4 5 6

discrete representation taxonomy:wordnet Example:good 2vec

Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate similarity

Vector representation 2vec One-hot vector [0, 0, 0,...1, 0,...0, 0] take too much space Vectors are orthogonal Hard to compute similarity

Dense vector 2vec Represent a vector by its neighbors Example: The cat is running in a room A dog is walking in a bedroom

Co-occurrence Matrix 2vec

problem 2vec extremely sparse hard to update high dimension

A neural probabilistic language model (Bengio et al., 2003) 2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

Most common methed: 2vec 2vec CBOW one-hot vector for the s around the center : x c m, x c m+1,...x c 1, x c+1,...x c+m v i = V x i.(i = c m,...c + m) ˆv = mean(v) Z = U ˆv ŷ = stmax(z) J(θ) = log P (u c ˆv) = u T c ˆv + log V j=1 exp(ut j ˆv)

skip-gram 2vec one-hot vector center x c v c = V x c z = U v c ŷ = stmax(z) J(θ) = 2m j=0 ut c m+j v c + 2mlog V k=1 exp(ut k v c)

2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

2vec 2vec can capture complex linguistic patterns but can t get global co-occurence statistics combines Co-occurrence Matrix and 2vex Q ij = exp(wt i w j) V (from skip-gram) k=1 exp(wt i ŵj) J = i corpus,j context(i) logq ij hard to compute

2vec J = V V i=1 j=1 X ijlogq ij (X ij )is from co-occurrence matrix X J = V i=1 X ih(p i, Q i ) Replace cross entropy with Least square: J = ij X i( ˆP ij ˆQ ij ) 2 ( ˆP ij = X ij, ˆQ ij = exp(w T i ŵ j )) X ij may be large J = ij X i(log ˆP ij log ˆQ ij ) 2 = ij X i(w T i ŵ j X ij ) 2 final: J = ij f(x ij)(w T i ŵ j X ij ) 2

2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

evaluate a 2vec Evaluation methods for unsupervised s(tobias Schnabel) Intrinsic: Use vectors as inputs for an elaborate machine learning system King - queen = man -woman bad - worst = good -best fast but unsure Extrinsic: Compute on your task slow but useful

Presentation Outline 2vec 1 2 2vec 3 4 5 6

? 2vec Performance is heavily dependent on the model used for Performance increases with larger corpus sizes: Performance is lower for extremely low as well as for extremely high dimensional vectors.but larger dimensions will lead to better performance. Corpus domain is more important than corpus size. small corpus(< 500M) uses skip-gram, big corpus use CBOW. 30-50 iteration at least 50 dimension

Ambiguity 2vec A may have several meanings.like: tie Linear Algebraic Structure Word Senses, with Applications to Polysemy(Sanjeev Arora) tie = α 1 tie 1 + α 2 tie 2 + α 3 tie 3 +... α i related to frequence tie i Given vector, about 60000, upper bound m,find a set context vectora 1, A 2...such that: v w = w i=1 α wja j + n w at most k α i are nonzero. Just sparse coding(k-svd) Find a set base in vector space, each can be represent by base.

Ambiguity 2vec

Problems 2vec powerful,strong and Paris are equally distant Word vector will lose the ordering the s and ignore semantics the s.

Presentation Outline 2vec 1 2 2vec 3 4 5 6

Doc2vec Distributed Representations s and Documents Distributed Memory Model Paragraph Vectors (PV-DM) The paragraph token can be thought as another 2vec

Distributed Bag Words version Paragraph Vector (PV-DBOW) 2vec Combination two metheds works better

A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS(Sanjeev Arora,2017) 2vec

some other method Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks 2vec MV-RNN s (Matrix-Vector Recursive Neural Networks

2vec That s all.thanks.