Vector-based Models of Semantic Composition. Jeff Mitchell and Mirella Lapata, 2008

Similar documents
Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only

A Study of Entanglement in a Categorical Framework of Natural Language

DISTRIBUTIONAL SEMANTICS

Manning & Schuetze, FSNLP (c) 1999,2000

Multiple Representations: Equations to Tables and Graphs Transcript

Mathematics: applications and interpretation SL

LINEAR ALGEBRA KNOWLEDGE SURVEY

Chapter 15. Probability Rules! Copyright 2012, 2008, 2005 Pearson Education, Inc.

Math 123, Week 2: Matrix Operations, Inverses

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

TEACHER NOTES FOR YEAR 11 MATHEMATICAL METHODS

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture 12: Algorithms for HMMs

Review of Coordinate Systems

Common Core Georgia Performance Standards Beginning Year at 3.2 (3.1 6 week recap)

Chemistry 125: Instructions for Erwin Meets Goldilocks

MITOCW ocw-18_02-f07-lec02_220k

Instructor (Brad Osgood)

Types in categorical linguistics (& elswhere)

Common Core Georgia Performance Standards: Curriculum Map

Introduction. The structure of the mark schemes

BMT 2013 The Algebra of Noncommutative Operators Power Round Time Limit: 60 mins. Maximum Score: 100 points. Instructions:

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Lecture 12: Algorithms for HMMs

Solutions to November 2006 Problems

Madison County Schools Suggested 3 rd Grade Math Pacing Guide,

Manning & Schuetze, FSNLP, (c)

CS632 Notes on Relational Query Languages I

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Vectors and their uses

Vectors. both a magnitude and a direction. Slide Pearson Education, Inc.

Lecture 2: Vector-Vector Operations

California CCSS Mathematics Grades 1-3

OCR Maths S1. Topic Questions from Papers. Bivariate Data

Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras

Euclidean Space. This is a brief review of some basic concepts that I hope will already be familiar to you.

Modeling Environment

What students need to know for... Functions, Statistics & Trigonometry (FST)

MITOCW ocw nov2005-pt1-220k_512kb.mp4

Monroe County Schools First Grade Math

Multiple forces or velocities influencing an object, add as vectors.

2.1 Definition. Let n be a positive integer. An n-dimensional vector is an ordered list of n real numbers.

4 th Grade Hinojosa Math Vocabulary Words

5. And. 5.1 The conjunction

Math (P)Review Part I:

Vector Operations Quick Look (*Draft )

Math Review -- Conceptual Solutions

7.1 What is it and why should we care?

6.2 Sine Language. A Solidify Understanding Task. In the previous task, George W. Ferris Day Off, you

Natural Language Processing

CS 246 Review of Linear Algebra 01/17/19

BUILT YOU. ACT Pathway. for

Text mining and natural language analysis. Jefrey Lijffijt

5. And. 5.1 The conjunction

CS 6120/CS4120: Natural Language Processing

Study Resources For Algebra I. Unit 2A Graphs of Quadratic Functions

Triangles and Vectors

1-4 Powers and Exponents

The Geometry of R n. Supplemental Lecture Notes for Linear Algebra Courses at Georgia Tech

Algebra II. Slide 1 / 261. Slide 2 / 261. Slide 3 / 261. Linear, Exponential and Logarithmic Functions. Table of Contents

Introduction to Statics

Complex Matrix Transformations

Algebra I Pre-AP Summer Packet 2018

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Functions

Pre AP Algebra. Mathematics Standards of Learning Curriculum Framework 2009: Pre AP Algebra

MITOCW watch?v=rf5sefhttwo

Parsing with Context-Free Grammars

Formal Semantics Of Verbs For Knowledge Inference

When the whole is not greater than the combination of its parts: A decompositional look at compositional distributional semantics

Pre-Algebra (6/7) Pacing Guide

CME Project, Algebra Correlated to: Missouri Mathematics Course-Level Expectations 2007 (Algebra 2)

Topic Modelling and Latent Dirichlet Allocation

CS 6120/CS4120: Natural Language Processing

Third Grade Report Card Rubric 1 Exceeding 2 Meeting 3 Developing 4 Area of Concern

TEACHER NOTES FOR YEAR 11 GENERAL MATHEMATICS

The Learning Objectives of the Compulsory Part Notes:

Contravariant and Covariant as Transforms

Semantics 2 Part 1: Relative Clauses and Variables

Mathematics 308 Geometry. Chapter 2. Elementary coordinate geometry

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

MITOCW watch?v=ed_xr1bzuqs

Ch. 7.3, 7.4: Vectors and Complex Numbers

7.7H The Arithmetic of Vectors A Solidify Understanding Task

Bi-County Collaborative

Error Correcting Codes Prof. Dr. P Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

MAHAPATRA218FALL12 ( MPMAHAPATRA218FALL12 )

(A B) 2 + (A B) 2. and factor the result.

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

v = ( 2)

Spatial Role Labeling CS365 Course Project

Vectors a vector is a quantity that has both a magnitude (size) and a direction

Assignment 3. Latent Semantic Indexing

topic modeling hanna m. wallach

Making the grade. by Chris Sangwin. Making the grade

Regular Expressions and Finite-State Automata. L545 Spring 2008

Grade 7 Middle School Math Solution Alignment to Oklahoma Academic Standards

Vectors are used to represent quantities such as force and velocity which have both. and. The magnitude of a vector corresponds to its.

MATH 320, WEEK 7: Matrices, Matrix Operations

Transcription:

Vector-based Models of Semantic Composition Jeff Mitchell and Mirella Lapata, 2008

Composition in Distributional Models of Semantics Jeff Mitchell and Mirella Lapata, 2010

Distributional Hypothesis Words occurring within similar contexts are semantically similar

What is a context? Window of words Co-occurrence frequencies Document LDA topic assignments

Semantic Vector Spaces tractor cat dog

Semantic Vector Spaces tractor cat dog

Aim of the paper How can we combine distributional vectors?

General Framework f : V x V V

General Framework

Composition Functions Additive Weighted Additive Kintsch Dilation Multiplicative Tensor Product Circular Convolution p i = u i + v i p i = αu i + βv i p i = u i + v i + n i p i = (λ 1)(u.v)u i + u 2 v i p i = u i v i p ij = u i v j p i = Σ j u j v i-j

u i v j

v 1 v 2 v 3 ( ) u 1 u 1 v 1 u 1 v 2 u 1 v 3 u 2 u 2 v 1 u 2 v 2 u 2 v 3 u 3 u 3 v 1 u 3 v 2 u 3 v 3

? Σ u v j i-j j

Composition Functions Additive Weighted Additive Kintsch Dilation Multiplicative Tensor Product Circular Convolution p i = u i + v i p i = αu i + βv i p i = u i + v i + n i p i = (λ 1)(u.v)u i + u 2 v i p i = u i v i p ij = u i v j p i = Σ j u j v i-j

Evaluation Method Human similarity judgements: Adjective-Noun American country European state

Evaluation Method Human similarity judgements: Adjective-Noun Noun-Noun state control town council

Evaluation Method Human similarity judgements: Adjective-Noun Noun-Noun Verb-Noun use knowledge provide system

Evaluation Method Human similarity judgements: Adjective-Noun Noun-Noun Verb-Noun Inter-subject agreement 0.52

Results (window) Model Adj-N N-N V-Obj Additive 0.36 0.39 0.30 Kintsch 0.32 0.22 0.29 Multiplicative 0.46 0.49 0.37 Tensor Product 0.41 0.36 0.33 Convolution 0.09 0.05 0.10 Weighted Additive 0.44 0.41 0.34 Dilation 0.44 0.41 0.38 Target Unit 0.43 0.34 0.29 Head Only 0.43 0.17 0.24 Human 0.52 0.49 0.55

Results (document) Model Adj-N N-N V-Obj Additive 0.37 0.45 0.40 Kintsch 0.30 0.28 0.33 Multiplicative 0.25 0.45 0.34 Tensor Product 0.39 0.43 0.33 Convolution 0.15 0.17 0.12 Weighted Additive 0.38 0.46 0.40 Dilation 0.38 0.45 0.41 Target Unit - - - Head Only 0.35 0.27 0.17 Human 0.52 0.49 0.55

Conclusion Promising results from several models: Multiplicative (Weighted) Additive Dilation Additional structure is needed

Notes to Slides Slide 1: The paper I was supposed to present. Slide 2: The surprise paper I presented. Slide 3: The Distributional Hypothesis in the authors' own words. Slide 4: In the 2010 paper, they consider two interpretations of context : either a window of 5 words either side the target word, or an entire document. When using windows, they normalise the co-occurrence frequencies according to the words' overall frequencies. When using documents, they use Latent Dirichlet Allocation as a further processing step. Slide 5: In both cases, you can embed these semantic representations in a vector space, with the dimensions defined by context words, or latent topics, respectively. At the end of this presentation, I will return to the question of whether we really want to view these objects as vectors. Slide 6: We can define similarity as the cosine of the angle between vectors. Slide 7: The authors assume that such a semantic space is available, and do not explore different ways of constructing such a space. Their aim is to explore different ways of combining two vectors to produce a third. This is useful if we want to produce semantic representations of larger constituents. Slides 8-9: The authors also give a more general framework, including background knowledge, but quickly abandon it. All their methods, except the tensor product, can be expressed as a function of this form. Slide 10: All functions they consider. Slide 11: Vector addition. This is symmetric ( commutative ), but syntax is not. The next three functions introduce an asymmetry. Slide 12: Weighted addition. We rescale the vectors before adding them. Slides 13-15: The Kintsch method. First, we find other vectors in the lexicon which are near to the head vector. Of these, we choose the one nearest to the modifier vector. We then add this to the sum of the head and modifier. Slides 16-17: Dilation. We decompose the head vector as components parallel and perpendicular to the modifier. We then stretch the parallel component. Slide 18: Componentwise multiplication. This function has no simple geometric interpretation (a point I will return to later). Each component of the result vector is the product of the components of the input vectors. Slides 19-20: Tensor product. Unfortunately, I can't draw an interesting tensor product in two dimensions, and giving a good intuition for tensor products takes longer than a couple of minutes. However, we can express a tensor product of two vectors as a matrix, as shown in the second slide. Slide 21: Circular Convolution. I don't think this makes sense at all, since the result depends on the order of the basis vectors. For instance, if we counted dogs first, then cats, we'd get a different answer than the other way round. Slide 22: All the functions shown one more time.

Slides 23-25: The authors compare their method against human similarity judgements, given for three types of composition. In each case, human subjects had to judge the similarity of two pairs of words, on a seven point scale. A model's predictions can be compared with the average human score used Spearman's rank correlation coefficient. Slide 26: Inter-subject agreement (average rank correlation with leave-one-out cross validation) is fairly low, which shows that this is a difficult task for humans, or that the judgements are not completely reliable. Slides 27-28: Results for both vector spaces considered. Two baselines are provided - target unit calculates a distributional vector for the phrase, as if it were a single word; head only uses the vector for the head word. The multiplicative method performs very well when using a window of words as a context. In particular, for noun-noun compounds, it performs as well as humans! Verb-object collocations seem to be more difficult for these models to capture. The dilation method performs the best. Circular convolution performs badly, as expected. Using documents as contexts lowers performance for the multiplicative method, and for adjective-noun compounds. It raises performance in most other cases. The (weighted) additive model performs well here. We might explain the lower performance of the multiplicative method with the LDA vector space as follows: most words will only appear in a small number of topics, so have components close to zero. When multiplying two such vectors, almost all components may then be close to zero. Slides 29-30: Why does the multiplicative method perform so well? Before answering this, we should first ask, what exactly does the function do? One important fact to note is that it must be defined with respect to a distinguished set of basis vectors. If we change the basis, we change the result in contrast, for addition, dilation, and tensor products, it does not matter what basis we choose. (The choice of basis also doesn't matter for more familiar multiplicative operations in linear algebra, such as the dot product and the cross product.) For this reason, the multiplicative composition function is not a natural operation on vector spaces. Slides 31-34: But if multiplicative composition is not a natural operation, why does it work so well? Here, it is helpful to go back and look at how we constructed the space to begin with. The dimensions correspond to cooccurrence frequencies, or LDA topics. In both cases, all components are positive in the 2-dimensional case, all vectors are in the top-right quadrant. Furthermore, if we compare vectors using cosine similarity, vectors in the same direction are indistinguishable. So, we can normalise all vectors to have length one, without affecting similarity. Now, all we have left of the original 2-dimensional vector space is the arc in the top-right quadrant. Although these points are embedded in a vector space, they do not form a vector space on their own.

How does the multiplicative method act on this space? It pushes points to the edge (in our example, to the two end points), and it is precisely this behaviour that corresponds to our linguistic intuition that the semantic representations are becoming more specific. So, even though the multiplicative composition function is not a natural operation on vectors, this doesn't matter because we're not really using vectors, anyway. The method works well because it exploits non-vector structure in the data. I think there would be much to gain for linguists to be explicit about this fact, and more carefully choose a structure that really is suitable for semantic representations, rather than simply applying a theory because the calculations are possible. Slide 35: In conclusion, a number of the models show promise in modelling semantic composition. However, non-vector structure is needed we have already seen that the multiplicative method exploits some additional structure, but much more is needed for a full theory of compositional semantics. The accompanying paper in this session (Erk and Padó, 2008) goes further, both in their selfpref-pow function (which is not a natural vector operation, for the same reason that the multiplicative method is not), and more explicitly in their structured vector space formalism.