Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst

Size: px

Start display at page:

Download "Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst"

Delilah Fisher
5 years ago
Views:

1 Word Representations via Gaussian Embedding Luke Vilnis Andrew McCallum University of Massachusetts Amherst

2 Vector word embeddings teacher chef astronaut composer person Low-Level NLP [Turian et al. 2010, Collobert et al. 2011] Named Entity Extraction [Passos et al. 2014] Machine Translation [Kalchbrenner & Blunsom 2013, Cho et al. 2014] Question Answering [Weston et al. 2015] road street lane boulevard

3 Vector word embeddings What s missing? - Breadth - Asymmetry composer person

4 Gaussian word embeddings Advantages - Breadth - Asymmetry famous person Bach composer

5 Gaussian word embeddings Advantages - Breadth - Asymmetry for each word i v i

6 Gaussian word embeddings Advantages - Breadth - Asymmetry for each word i N (x; µ i, i )

7 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry for each word i N (x; µ i, i )

8 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry for each word i N (x; µ i, i ) / log det( i ) (µ i x) > 1 i (µ i x)

9 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry for each word i { Mahalanobis distance measured by Σi logarithmic penalty on volume due to normalization { N (x; µ i, i ) / log det( i ) (µ i x) > 1 i (µ i x)

10 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry: KL-divergence for each word i { Mahalanobis distance measured by Σi logarithmic penalty on volume due to normalization { N (x; µ i, i ) / log det( i ) (µ i x) > 1 i (µ i x)

11 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry: KL-divergence KL(N i kn j ) = Z x N (x; µ i, i ) log N (x; µ i, i ) N (x; µ j, j ) dx

12 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry: KL-divergence KL(N i kn j ) = Z x N (x; µ i, i ) log N (x; µ i, i ) N (x; µ j, j ) dx [Wikipedia]

13 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry: KL-divergence KL(N i kn j ) / tr( 1 i j ) (µ i µ j ) > 1 i (µ i µ j ) log det( i) det( j )

14 Gaussian word embeddings Advantages - Breadth: covariance matrix - Asymmetry: KL-divergence KL(N i kn j ) / tr( 1 i j ) (µ i µ j ) > 1 i (µ i µ j ) log det( i) det( j) { { { directions of variance should be aligned, i should be large and j small distance between means is small as measured by i logarithmic penalty on volume due to normalization

15 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque

16 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque E(word i, word j ) = hv i,v j i

17 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque (composer, musician) E(word i, word j ) = hv i,v j i

18 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque random (composer, musician) (composer, dictionary ) word E(word i, word j ) = hv i,v j i

19 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque (composer, musician) (composer, banana) E(word i, word j ) = hv i,v j i

20 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque E(composer, musician) > E(composer, banana) E(word i, word j ) = hv i,v j i

21 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque E(composer, musician) > E(composer, banana) E(word i, word j ) = X k v (k) v (k) i j

22 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque E(composer, musician) > E(composer, banana) E(word i, word j ) = Z k v i (k)v j (k)dk

23 Learning vector embeddings e.g. [Mikolov et al. 2013] German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = Z k > E(composer, banana) v i (k)v j (k)dk

24 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = Z k > E(composer, banana) v i (k)v j (k)dk

25 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = [PPK, Jebara et al. 2003] Z x > E(composer, banana) N (x; µ i, i )N (x; µ j, j )dx

26 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = [PPK, Jebara et al. 2003] Z x > E(composer, banana) N (x; µ i, i )N (x; µ j, j )dx = N (0; µ i µ j, i + j )

27 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = [PPK, Jebara et al. 2003] Z x > E(composer, banana) N (x; µ i, i )N (x; µ j, j )dx = N (0; µ i µ j, i + j ) / log det( i + j ) (µ i µ j ) > ( i + j ) 1 (µ i µ j )

28 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = [PPK, Jebara et al. 2003] Z x > E(composer, banana) N (x; µ i, i )N (x; µ j, j )dx = N (0; µ i µ j, i + j ) / log det( i + j ) (µ i µ j ) > ( i + j ) 1 (µ i µ j ) { log-volume of ellipse { Mahalanobis distance between means

29 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) E(word i, word j ) = [PPK, Jebara et al. 2003] Z x > E(composer, banana) N (x; µ i, i )N (x; µ j, j )dx = N (0; µ i µ j, i + j )

30 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) > E(composer, banana) Loss PPK (w, c pos,c neg )= max(0,m E PPK (w, c pos )+E PPK (w, c neg ))

31 Learning Gaussian embeddings German musician and composer of the Baroque E(composer, musician) > E(composer, banana) Loss KL (w, c pos,c neg ) = max(0,m+ KL(c pos kw) KL(c neg kw)) (asymmetric supervision)

32 Related work Asymmetric, sparse, distributional [Baroni et al. 2012] Dense can be better [Baroni et al. 2014] Symmetric, dense [Bengio et al. 2003, Mikolov et al. 2013, many others] Bayesian matrix factorization [Salakhutdinov & Mnih 2008] (Mixture) density networks [Bishop 1994] Gaussian process neural nets [Damianou & Lawrence 2013]

33 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

34 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

35 Synthetic hierarchy Train data Objective child ` parent parent parent child child KL(v child kv parent )

36 Synthetic hierarchy Train data Learned model KL objective accurately learns all containments

37 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

38 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

39 Entailment Binary labeled dataset of entailment pairs [Baroni et al. 2012] adrenaline is-a neurotransmitter archbishop is-a clergyman horse is-a mammal pizza is-a food (+)

40 Entailment Binary labeled dataset of entailment pairs [Baroni et al. 2012] adrenaline is-a neurotransmitter archbishop is-a clergyman horse is-a mammal pizza is-a food (+) aircrew is-not-a playlist bamboo is-not-a bear (-) no relation

41 Entailment Binary labeled dataset of entailment pairs [Baroni et al. 2012] adrenaline is-a neurotransmitter archbishop is-a clergyman horse is-a mammal pizza is-a food (+) aircrew is-not-a playlist bamboo is-not-a bear (-) no relation food is-not-a pizza molecule is-not-a carbohydrate gathering is-not-a seminar (-) reversed

42 Entailment Model: diagonal (D) and spherical (S) variances Train: ~1b tokens Wikipedia + 3b tokens of newswire Evaluate: optimal F1 operating point, average precision

43 Entailment Model: diagonal (D) and spherical (S) variances Train: ~1b tokens Wikipedia + 3b tokens of newswire Evaluate: optimal F1 operating point, average precision

44 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

45 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

46 Symmetric word similarity Word similarity tasks (e.g. WordSim-353) (money, bank, 8.5) (psychology, Freud, 8.21) (media, radio, 7.42) (drug, abuse, 6.85) (Mars, scientist, 5.63) (cup, object, 3.69) (professor, cucumber, 0.31) Evaluate: Spearman s ρ

47 Symmetric word similarity skip-gram sphere, μ sphere, μ, Σ diagonal, μ diagonal, μ, Σ

48 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

49 Experimental results Synthetic hierarchies Entailment Word similarity tasks Scientific key phrase finding

50 physics Scientific key-phrase finding computational fluid dynamics Navier Stokes equations partial differential equations

51 Scientific key-phrase finding What makes a good key-phrase? - High frequency - Predictive

52 Scientific key-phrase finding What makes a good key-phrase? - High frequency - Predictive Phrases Frequent? Predictive? conventional wisdom suggests pre-defined categories paper describes experimental results EXPTIME complete autocorrelation function operational semantics regular languages No Yes No Yes No No Yes Yes

53 Scientific key-phrase finding Sample key-phrases from scientific paper abstracts: frequent, predictive rare, uninformative linear matrix inequality satisfiability problem encryption schemes sparse matrix vector spaces exploratory study theoretical basis major contributions hot topic

54 Thank you! Conclusion Introduced Gaussian word embeddings: - Capture asymmetry - Capture broadness of meaning and uncertainty - Expressive, dense, distributed representation - Scalable learning 4 billion tokens, 1 core, 8 hours Future work: - Multi-peaked, unnormalized, non-gaussian - Relations, documents, semantic frames - Non-NLP domains for density representations

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the