Cross-Modal Retrieval: A Pairwise Classification Approach

Size: px

Start display at page:

Download "Cross-Modal Retrieval: A Pairwise Classification Approach"

Julianna Hodge
5 years ago
Views:

1 Cross-Modal Retrieval: A Pairwise Classification Approach Aditya Krishna Menon 1,2, Didi Surian 1,3 and Sanjay Chawla 3,4 1 National ICT Australia, Australia 2 Australian National University, Australia 3 The University of Sydney, Australia 4 Qatar Computing Research Institute, Qatar SIAM International Conference on Data Mining 2015 April 30th, 2015

2 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

3 Background Content is increasingly available in multiple modalities (images, text, videos,...) Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.

4 Cross-modal retrieval: informally Given the representation of an entity in one modality, find its best representation in all other modalities

5 Cross-modal retrieval: informally e.g. Given a piece of text, what is the image that best accompanies it? Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.

6 Cross-modal retrieval: informally e.g. Given a piece of text, what is the image that best accompanies it? Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.

7 Prior work: CCA Canonical Correlation Analysis (CCA): project both modalities onto a latent subspace where they exhibit maximum correlation Maximally Correlated Image Space Sub spaces Text Space R I U I U I U T R T U I U T U T

8 Prior work: SM Semantic Matching (SM): project both modalities into a rich supervised subspace Image Space R I Semantic Space Image classifiers Text classifiers Text Space R T McAleer's tenure as part owner of the Red Sox came to a swift end. On July 15, 1913, McAleer became involved in a dispute with the AL president, Ban Johnson

9 This paper A pairwise classification perspective on cross-modal retrieval Goal is to score pairs of instances based on affinity Seamlessly study scenarios with and without ground-truth annotation Essentially, reduction to logistic regression on pairs of instances Unification of the CCA and SM approaches Good empirical performance on real-world retrieval tasks

10 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

11 (Unsupervised) Cross-modal retrieval: training set Training examples: D = {(a (i), b (i) )} n i=1 (" )"," ("," )" ("," )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.

12 (Unsupervised) Cross-modal retrieval: training set Training examples: D = {(a (i), b (i) )} n i=1 Set of suitable pairings of entities across two modalities (a (i), b (i) ) A B, A R da and B R d b the modality representations (" )"," ("," )" ("," )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.

13 Cross-modal retrieval: goal Goal: Learn retrieval mappings f : A B and g : B A

14 Cross-modal retrieval: goal Goal: Learn retrieval mappings f : A B and g : B A e.g. Given a piece of text, what is the best accompanying image? f!:! They call me 'The Wild Rose' But my name was Elisa Day Why they call me it, I do not know For my name was Elisa Day

Assumption: ground-truth annotations We assume there is a set of ground-truth

.., K} Arts% Nature% I WANDER'D lonely as a cloud That floats on high o'er vales

the lake, beneath the trees, Fluttering and dancing in the breeze.

15 Assumption: ground-truth annotations We assume there is a set of ground-truth annotations for each entity For our purposes, a label from some set Y = {1, 2,..., K} Arts% Nature% I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze. Nature% History% Poli.cs% Science% These annotations may be unobserved But are still used for evaluation

16 (Supervised) Cross-modal retrieval: training set Training examples: D = {((a (i), b (i) ), )} n i=1

17 (Supervised) Cross-modal retrieval: training set Training examples: D = {((a (i), b (i) ), y (i) )} n i=1 y (i) Y are the annotations (' I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,',' )' Nature' (',',' Arts' )' (',' )','History' Goal is as before

18 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

19 Latent map approach Popular approach: learn latent maps ψ A : A R k, ψ B : B R k k N + is the dimensionality of some latent subspace Compute retrieval maps f : a argmin d(ψ A (a), ψ B (b)) b B g : b argmin d(ψ A (a), ψ B (b)), a A where d(, ) is some suitable distance function

20 Latent map approach Popular approach: learn latent maps ψ A : A R k, ψ B : B R k k N + is the dimensionality of some latent subspace Compute retrieval maps f : a argmin d(ψ A (a), ψ B (b)) b B g : b argmin d(ψ A (a), ψ B (b)), a A where d(, ) is some suitable distance function e.g.: In CCA, we learn linear projections and use Euclidean distance for retrieval ψ A : a Ua ψ B : a Vb,

21 Similarity function approach Our approach: learn a similarity function, s : A B R Assigns a high score when the given pair are related to each other Compute retrieval maps f : a argmax b B g : b argmax a A s(a, b) s(a, b). Captures the latent map approach as a special case

22 Similarity function approach Our approach: learn a similarity function, s : A B R Assigns a high score when the given pair are related to each other Compute retrieval maps f : a argmax b B g : b argmax a A s(a, b) s(a, b). Captures the latent map approach as a special case But what is a good similarity?

23 What is a good similarity? s(a, b) should be high iff a, b have the same ground-truth annotation

24 What is a good similarity? s(a, b) should be high iff a, b have the same ground-truth annotation A: random variable over A (1 st modality) B: random variable over B (2 nd modality) Y: random variable over Y (annotation for A) Y : random variable over Y (annotation for B)

25 What is a good similarity? s(a, b) should be high iff a, b have the same ground-truth annotation A: random variable over A (1 st modality) B: random variable over B (2 nd modality) Y: random variable over Y (annotation for A) Y : random variable over Y (annotation for B) Here, Z := 2 Y = Y 1 {±1} whether or not the two annotations agree

26 What is a good similarity? s(a, b) should be high iff a, b have the same ground-truth annotation A: random variable over A (1 st modality) B: random variable over B (2 nd modality) Y: random variable over Y (annotation for A) Y : random variable over Y (annotation for B) Here, Z := 2 Y = Y 1 {±1} whether or not the two annotations agree Judge s according to probability of incorrectly assessing compatible pairs: L(s) = P A,B,Z (s(a, B) Z < 0). sign of s should indicate whether the ground truths align

27 Reduction to pairwise classification We can view Z as a binary label on the instance pair (A, B)

28 Reduction to pairwise classification We can view Z as a binary label on the instance pair (A, B) Learning a similarity function binary classification on the paired instance space A B

29 Reduction to pairwise classification We can view Z as a binary label on the instance pair (A, B) Learning a similarity function binary classification on the paired instance space A B Ideally, treat D = {((a (i), b (i) ), z (i) )} n i=1 as input to a standard binary classifier

30 Reduction to pairwise classification We can view Z as a binary label on the instance pair (A, B) Learning a similarity function binary classification on the paired instance space A B Ideally, treat D = {((a (i), b (i) ), z (i) )} n i=1 as input to a standard binary classifier More effort is needed depending on whether or not we have ground-truth annotations

31 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

32 Learning without annotations Without ground-truth annotations, we only observe pairs that are linked We do not observe pairs that do not link Training set: D = {((a (i), b (i) ), z (i) )} where all labels z (i) = +1 (",","+" )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,","+" (",","+")"

33 Learning without annotations Without ground-truth annotations, we only observe pairs that are linked We do not observe pairs that do not link Training set: D = {((a (i), b (i) ), z (i) )} where all labels z (i) = +1 Learning from positive only data Cannot apply standard binary classifier! (",","+" )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,","+" (",","+")"

34 Reduction to positive and unlabelled learning Basic idea: Construct all pairs of instances from each modality D = {((a (i), b (j) ), z (i,j) )} n i,j=1 We know z (i,i) = 1, but z (i,j) unknown for i j (",","+" )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,","+" (",","+")" (",","?")" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,","?" (",","?")"

35 Reduction to positive and unlabelled learning Basic idea: Construct all pairs of instances from each modality D = {((a (i), b (j) ), z (i,j) )} n i,j=1 We know z (i,i) = 1, but z (i,j) unknown for i j Learning from positive and unlabelled data Aims to better discriminate linked and unlinked pairs (",","+" )" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,","+" (",","?")" Loveliest)of)trees,)the) cherry)now2 Is)hung)with)bloom) along)the)bough,2 And)stands)about)the) woodland)ride2 Wearing)white)for) Eastertide.2 (" )" I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze. (",","+")" (",","?")",","?"

36 Reduction to binary classification Simple recipe for positive and unlabelled learning [Elkan and Noto, 2008]: Treat unlabelled instances as negatives Learn a class-probability estimator (e.g. logistic regression) Optimal for ranking purposes!

37 Reduction to binary classification Simple recipe for positive and unlabelled learning [Elkan and Noto, 2008]: Treat unlabelled instances as negatives Learn a class-probability estimator (e.g. logistic regression) Optimal for ranking purposes! Thus, we can perform logistic regression on the labelled instances Loveliest of trees, the cherry now Is hung with bloom along the bough, And stands about the woodland ride Wearing white for Eastertide. I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze. D = {((a (i), b (j) ), (2 i = j 1))} n i,j=1 (,, + ) ( ),, + (,, + ) (,, - ) Loveliest of trees, the cherry now Is hung with bloom along the bough, And stands about the woodland ride Wearing white for Eastertide. ( ) I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Beside the lake, beneath the trees, Fluttering and dancing in the breeze.,, - (,, - )

38 Choice of feature mapping We consider l 2 regularised logistic regression with some linear scorer: min w 1 n 2 n i=1 n log(1 + e z(i,j) s(a (i),b (j) ;w) ) + λ 2 w 2 2 j=1 where s(a, b; w) = w, Φ(a, b) for some feature mapping Φ Simplest Φ is concatenation: Φ(a, b) = [ a b ]

39 Choice of feature mapping We consider l 2 regularised logistic regression with some linear scorer: min w 1 n 2 n i=1 n log(1 + e z(i,j) s(a (i),b (j) ;w) ) + λ 2 w 2 2 j=1 where s(a, b; w) = w, Φ(a, b) for some feature mapping Φ Simplest Φ is concatenation: Φ(a, b) = [ a b ] Problem: Induces the same ranking over b B for any a A!

40 Choice of feature mapping Better Φ: Φ(a, b) = [ ] a d b d d,d i.e. consider the cross-product of features The similarity function is equivalently: s(a, b; w) = w, Φ(a, b) = a T Wb c.f. bilinear regression CCA low rank approximation to W

41 Choice of feature mapping Better Φ: Φ(a, b) = [ ] a d b d d,d i.e. consider the cross-product of features The similarity function is equivalently: s(a, b; w) = w, Φ(a, b) = a T Wb c.f. bilinear regression CCA low rank approximation to W Learn, for each pair of features across the modalities, their predictive power in determining affinity

42 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

43 Learning with annotations With ground-truth annotations, we still only observe pairs that are linked Training set: D = {((a (i), b (i) ), y (i) )} n i=1 y (i) Y is the category for the ith entity (',',' )' I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Nature' Beside the lake, beneath the trees, Fluttering and dancing in the breeze. (',',' Arts' )' (',' )','History'

Learning with annotations With ground-truth annotations, we still only observe pairs that are linked Training set: D = {((a (i), b (i) ), y (i) )} n i=1 y (i) Y is the category for the ith entity

44 Learning with annotations With ground-truth annotations, we still only observe pairs that are linked Training set: D = {((a (i), b (i) ), y (i) )} n i=1 y (i) Y is the category for the ith entity (',',' )' I WANDER'D lonely as a cloud That floats on high o'er vales and hills, When all at once I saw a crowd, A host, of golden daffodils; Nature' Beside the lake, beneath the trees, Fluttering and dancing in the breeze. (',',' Arts' )' (',' )','History' However, using annotations, we can more accurately infer whether other pairs should be linked or not

45 Pairwise classification approach Basic idea: Construct all pairs of instances from each modality where z (i,j) = 2 y (i) = y (j) 1 D = {((a (i), b (j) ), z (i,j) )} n i,j=1 Link a pair iff ground-truth annotation coincides Now learn logistic regression model as in the case without annotations Only changing the labelling procedure

46 An alternate approach There is a simpler approach to learning with annotations If well-specified, logistic regression will learn a transformation of

47 An alternate approach There is a simpler approach to learning with annotations If well-specified, logistic regression will learn a transformation of P(Z = 1 A = a, B = b) = P(Y = Y A = a, B = b) K = P(Y = k A = a) P(Y = k B = b) k=1 = η A (a), η B (b), where η A and η B are vectors of instance-conditional distributions over labels

48 An alternate approach There is a simpler approach to learning with annotations If well-specified, logistic regression will learn a transformation of P(Z = 1 A = a, B = b) = P(Y = Y A = a, B = b) K = P(Y = k A = a) P(Y = k B = b) k=1 = η A (a), η B (b), where η A and η B are vectors of instance-conditional distributions over labels Why not just directly learn η A, η B?

49 Marginal approach Basic idea: Learn independent models for η A and η B e.g. using multi-class logistic regression on {(a (i), y (i) )} and {(b (i), y (i) )} Use their dot-product as the similarity function c.f. use of cosine similarity in SM method While individual probability models are linear, final similarity function will be nonlinear in the inputs

50 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

51 Overview of experiments Experiments on three datasets: Wikipedia: 2,173 training, 693 test pairs; 10 semantic categories. Pascal: 700 training, 300 test pairs; 20 semantic categories. TVGraz: 1,558 training, 500 testing pairs; 10 semantic categories. Feature representations: Text: LDA topic assignment distributions Images: SIFT features

52 Methods compared Baselines: Random predictor OC-SVM CCA SM, SCM Cross-modal metric learning (CMML) Measure performance based on mean average precision (MAP)

53 Results without annotation Wikipedia Image query Text query Pascal Image query Text query Random OC SVM CCA CMML Cross Prod 0.05 TVGraz Random OC SVM CCA CMML Cross Prod Image query Text query Random OC SVM CCA CMML Cross Prod MAP score improvements to CCA: 2% on Wikipedia, 10% on Pascal, and 3% on TVGraz.

54 Results with annotation Wikipedia Image query Text query Pascal Image query Text query SM SCM Pairwise SM Marginal SM Marginal SCM 0.15 SM SCM Pairwise SM Marginal SM Marginal SCM TVGraz Image query Text query SM SCM Pairwise SM Marginal SM Marginal SCM MAP score improvements to SM: 10% on Wikipedia, 10% on Pascal, and 5% on TVGraz.

55 Precision-recall curves on Wikipedia Consistent dominance of cross-product method over CCA, marginal over SM Image Queries Wikipedia CCA Cross Product SM Marginal SM Text Queries Wikipedia CCA Cross Product SM Marginal SM Precision Precision Recall Recall

56 Case study: text query Given a text, find images TEXT QUERY Cobby was a member of the RAAF Reserve (also known as the Ci8zen Air Force) during his 8me with the Civil Avia8on Board, and rejoined the Permanent Air Force following the outbreak of World War II Warfare Image Results CCA Cross- product Warfare Warfare Warfare Warfare Warfare Warfare Geography Music CCA suffers from not considering negative links

Case study: image query Given an image (top), find texts

Most of the Columbia's drainage basin (which, at, is

l the 20th century, the slough and its side channels and

ve The Columbia begins its journey in the southern Rocky

sh Columbia (BC) There are over of hiking trails at

Most of the trails are rocky and steep, so hikers are

traced back to 1904, when Mary Victoria Leiter Curzon As

165,743 households, and 99,114 families residing in

ons have an important role in the stewardship of the

57 Case study: image query Given an image (top), find texts IMAGE QUERY Geography Cross- product CCA Text Results Most of the Columbia's drainage basin (which, at, is about the size of France) lies roughly between the Un?l the 20th century, the slough and its side channels and associated ponds and lakes were part of the ac?ve The Columbia begins its journey in the southern Rocky Mountain Trench in Bri?sh Columbia (BC) There are over of hiking trails at Worlds End State Park. Most of the trails are rocky and steep, so hikers are The history of Kaziranga as a protected area can be traced back to 1904, when Mary Victoria Leiter Curzon As of the census of 2006, there were 382,872 people, 165,743 households, and 99,114 families residing in Various public sector organisa?ons have an important role in the stewardship of the country s fauna Only humans are recognized as persons and protected in law by the United Na?ons Universal Declara?on of Geography Geography Geography Geography Geography Geography Biology Biology

58 Overview of talk 1 Overview 2 Cross-modal retrieval: unsupervised and supervised 3 The similarity function perspective 4 Learning without annotations 5 Learning with annotations 6 Experiments 7 Conclusion

59 Conclusion Pairwise classification perspective on cross-modal retrieval Unified perspective in the presence and absence of ground truth annotations Logistic regression approach for learning without and with annotations Without: reduction to positive and unlabelled learning With: learn probability distributions for each modality Good empirical performance

60 Questions?

Information Extraction from Text

Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information