The representation of word and sentence

2vec Jul 4, 2017

Presentation Outline 2vec 1 2 2vec 3 4 5 6

discrete representation taxonomy:wordnet Example:good 2vec

Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate similarity

Vector representation 2vec One-hot vector [0, 0, 0,...1, 0,...0, 0] take too much space Vectors are orthogonal Hard to compute similarity

Dense vector 2vec Represent a vector by its neighbors Example: The cat is running in a room A dog is walking in a bedroom

Co-occurrence Matrix 2vec

problem 2vec extremely sparse hard to update high dimension

A neural probabilistic language model (Bengio et al., 2003) 2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

Most common methed: 2vec 2vec CBOW one-hot vector for the s around the center : x c m, x c m+1,...x c 1, x c+1,...x c+m v i = V x i.(i = c m,...c + m) ˆv = mean(v) Z = U ˆv ŷ = stmax(z) J(θ) = log P (u c ˆv) = u T c ˆv + log V j=1 exp(ut j ˆv)

skip-gram 2vec one-hot vector center x c v c = V x c z = U v c ŷ = stmax(z) J(θ) = 2m j=0 ut c m+j v c + 2mlog V k=1 exp(ut k v c)

2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

2vec 2vec can capture complex linguistic patterns but can t get global co-occurence statistics combines Co-occurrence Matrix and 2vex Q ij = exp(wt i w j) V (from skip-gram) k=1 exp(wt i ŵj) J = i corpus,j context(i) logq ij hard to compute

2vec J = V V i=1 j=1 X ijlogq ij (X ij )is from co-occurrence matrix X J = V i=1 X ih(p i, Q i ) Replace cross entropy with Least square: J = ij X i( ˆP ij ˆQ ij ) 2 ( ˆP ij = X ij, ˆQ ij = exp(w T i ŵ j )) X ij may be large J = ij X i(log ˆP ij log ˆQ ij ) 2 = ij X i(w T i ŵ j X ij ) 2 final: J = ij f(x ij)(w T i ŵ j X ij ) 2

2vec

Presentation Outline 2vec 1 2 2vec 3 4 5 6

evaluate a 2vec Evaluation methods for unsupervised s(tobias Schnabel) Intrinsic: Use vectors as inputs for an elaborate machine learning system King - queen = man -woman bad - worst = good -best fast but unsure Extrinsic: Compute on your task slow but useful

Presentation Outline 2vec 1 2 2vec 3 4 5 6

? 2vec Performance is heavily dependent on the model used for Performance increases with larger corpus sizes: Performance is lower for extremely low as well as for extremely high dimensional vectors.but larger dimensions will lead to better performance. Corpus domain is more important than corpus size. small corpus(< 500M) uses skip-gram, big corpus use CBOW. 30-50 iteration at least 50 dimension

Ambiguity 2vec A may have several meanings.like: tie Linear Algebraic Structure Word Senses, with Applications to Polysemy(Sanjeev Arora) tie = α 1 tie 1 + α 2 tie 2 + α 3 tie 3 +... α i related to frequence tie i Given vector, about 60000, upper bound m,find a set context vectora 1, A 2...such that: v w = w i=1 α wja j + n w at most k α i are nonzero. Just sparse coding(k-svd) Find a set base in vector space, each can be represent by base.

Ambiguity 2vec

Problems 2vec powerful,strong and Paris are equally distant Word vector will lose the ordering the s and ignore semantics the s.

Presentation Outline 2vec 1 2 2vec 3 4 5 6

Doc2vec Distributed Representations s and Documents Distributed Memory Model Paragraph Vectors (PV-DM) The paragraph token can be thought as another 2vec

Distributed Bag Words version Paragraph Vector (PV-DBOW) 2vec Combination two metheds works better

A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS(Sanjeev Arora,2017) 2vec

some other method Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks 2vec MV-RNN s (Matrix-Vector Recursive Neural Networks

2vec That s all.thanks.