Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Similar documents
Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

DISTRIBUTIONAL SEMANTICS

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

Convolutional Neural Network Architecture

From perceptrons to word embeddings. Simon Šuster University of Groningen

Interpreting Deep Classifiers

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

arxiv: v2 [cs.cl] 1 Jan 2019

Deep Learning for NLP Part 2

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CSC321 Lecture 16: ResNets and Attention

text classification 3: neural networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Iterative Laplacian Score for Feature Selection

Lecture 5 Neural models for NLP

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Binary Convolutional Neural Network on RRAM

Intelligent Systems:

Non-linear Dimensionality Reduction

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Deep Learning for NLP

Intelligent Systems Discriminative Learning, Neural Networks

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Deep Learning. Language Models and Word Embeddings. Christof Monz

Assignment 1. Learning distributed word representations. Jimmy Ba

Machine Learning Summer 2018 Exercise Sheet 4

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

CITS 4402 Computer Vision

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

CSE446: non-parametric methods Spring 2017

Introduction to Deep Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Latent Semantic Analysis. Hongning Wang

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Representing structured relational data in Euclidean vector spaces. Tony Plate

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Introduction to Deep Learning

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Latent Semantic Analysis. Hongning Wang

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

Pytorch Tutorial. Xiaoyong Yuan, Xiyao Ma 2018/01

arxiv: v1 [stat.ml] 17 Sep 2012

Natural Language Processing

Classification on Pairwise Proximity Data

Towards a Data-driven Approach to Exploring Galaxy Evolution via Generative Adversarial Networks

Generalized Zero-Shot Learning with Deep Calibration Network

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Collaborative Quantization for Cross-Modal Similarity Search

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Machine Learning. Boris

Supervised Learning: Non-parametric Estimation

NEURAL LANGUAGE MODELS

Machine Learning for natural language processing

Segmentation of Cell Membrane and Nucleus using Branches with Different Roles in Deep Neural Network

Natural Language Processing with Deep Learning CS224N/Ling284

Contents. (75pts) COS495 Midterm. (15pts) Short answers

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

An overview of word2vec

Logistic Regression & Neural Networks

CLEF 2017: Multimodal Spatial Role Labeling (msprl) Task Overview

Grundlagen der Künstlichen Intelligenz

Natural Language Processing

Machine Learning for Signal Processing Bayes Classification and Regression

Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst

Chapter 3: Basics of Language Modelling

Generative Adversarial Networks. Presented by Yi Zhang

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Quantum Convolutional Neural Networks

Learning to translate with neural networks. Michael Auli

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Neural networks and optimization

Neural Architectures for Image, Language, and Speech Processing

Random Coattention Forest for Question Answering

A Study of Entanglement in a Categorical Framework of Natural Language

Data Mining und Maschinelles Lernen

Introduction to machine learning

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Temporal Modeling and Basic Speech Recognition

Enhanced Fourier Shape Descriptor Using Zero-Padding

Riemannian Optimization for Skip-Gram Negative Sampling

Deep Learning Year in Review 2016: Computer Vision Perspective

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Statistical Pattern Recognition

Lecture 14: Deep Generative Learning

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CS 6120/CS4120: Natural Language Processing

Sparse Word Embeddings Using l 1 Regularized Online Learning

Optimization geometry and implicit regularization

Transcription:

Do Neural Network Cross-Modal Mappings Really Bridge Modalities? Language Intelligence and Information Retrieval group (LIIR) Department of Computer Science

Story Collell, G., Zhang, T., Moens, M.F. (2017) Imagined Visual Representations as Multimodal Embeddings. AAAI Learn mapping f : text vision. Finding 1: Imagined vectors, f (text), outperform original visual vectors in 7/7 word similarity tasks. So, why are mapped vectors multimodal? We conjecture: Continuity. Output vector is nothing but the input vector transformed by a continuous map: f ( x ) = x θ. Finding 2 (not in AAAI paper): Vectors imagined with an untrained network do even better.

Motivation Applications (e.g., zero-shot image tagging, zero-shot translation or cross-modal retrieval): Use linear or NN maps to bridge modalities / spaces. Then, they tag / translate based on neighborhood structure of mapped vectors f (X). Research question: Is the neighborhood structure of f (X) similar to that of Y? Or rather to X? How to measure similarity of 2 sets of vectors from different spaces? Idea: mean nearest neighbor overlap (mnno)

General Setting Mappings f : X Y to bridge modalities X and Y: Linear (lin): f (x) = W 0 x + b 0 Feed-forward neural net (nn): f (x) = W 1 σ(w 0 x + b 0 ) + b 1 f(m ) M f(m )

Experiment 1 Definition Nearest Neighbor Overlap (NNO K (v i, z i )) = number of K nearest neighbors that two paired data points v i, z i share in their respective spaces. The mean NNO is: mnno K (V, Z ) = 1 KN N NNO K (v i, z i ) i=1 { NN 3 (v cat ) = {v dog, v tiger, v lion } NN 3 (z cat ) = {z mouse, z tiger, z lion } NNO 3 (v cat, z cat ) = 2 (1)

Experiment 1 Goal: Learn map f : X Y and calculate mnno(y, f (X)). Compare it with mnno(x, f (X)) Experimental Setup Datasets: (i) ImageNet; (ii) IAPR TC-12; (iii) Wikipedia Visual features: VGG-128 and ResNet. Text features: ImageNet (GloVe and word2vec); IAPR TC-12 & Wikipedia (bigru). Loss: MSE = 1 2 f (x) y 2. We also tried max-margin and cosine.

Experiment 1: Results ImageNet IAPR TC-12 Wikipedia I T T I I T T I I T T I ResNet VGG-128 X, f (X) Y, f (X) X, f (X) Y, f (X) lin 0.681 0.262 0.723 0.236 nn 0.622 0.273 0.682 0.246 lin 0.379 0.241 0.339 0.229 nn 0.354 0.27 0.326 0.256 lin 0.358 0.214 0.382 0.163 nn 0.336 0.219 0.331 0.18 lin 0.48 0.2 0.419 0.167 nn 0.413 0.225 0.372 0.182 lin 0.235 0.156 0.235 0.143 nn 0.269 0.161 0.282 0.148 lin 0.574 0.156 0.6 0.148 nn 0.521 0.156 0.511 0.151 Table: X, f (X) and Y, f (X) denote mnno 10 (X, f (X)) and mnno 10 (Y, f (X)), respectively.

Experiment 2 Goal: Map X with an untrained net f and compare performance of X with that of f (X). We ablate from Experiment 1 the learning part and the choices of loss and output vectors. Experimental Setup Evaluate vectors in: (i) Semantic similarity: SemSim, Simlex-999 and SimVerb-3500. (ii) Relatedness: MEN and WordSim-353. (iii) Visual similarity: VisSim.

Experiment 2: Results WS-353 Men SemSim Cos Eucl Cos Eucl Cos Eucl f nn(glove) 0.632 0.634 0.795 0.791 0.75 0.744 f lin (GloVe) 0.63 0.606 0.798 0.781 0.763 0.712 GloVe 0.632 0.601 0.801 0.782 0.768 0.716 f nn(resnet) 0.402 0.408 0.556 0.554 0.512 0.513 f lin (ResNet) 0.425 0.449 0.566 0.534 0.533 0.514 ResNet 0.423 0.457 0.567 0.535 0.534 0.516 VisSim SimLex SimVerb Cos Eucl Cos Eucl Cos Eucl f nn(glove) 0.594 0.59 0.369 0.363 0.313 0.301 f lin (GloVe) 0.602 0.576 0.369 0.341 0.326 0.23 GloVe 0.606 0.58 0.371 0.34 0.32 0.235 f nn(resnet) 0.527 0.526 0.405 0.406 0.178 0.169 f lin (ResNet) 0.541 0.498 0.409 0.404 0.198 0.182 ResNet 0.543 0.501 0.409 0.403 0.211 0.199 Table: Spearman correlations between human ratings and similarities (cosine or Euclidean) predicted from embeddings.

Conclusions and Future Work Conclusions: Neighborhood structure of f (X) more similar to X than Y. Neighborhood structure of embeddings not significantly disrupted by mapping them with an untrained net. Future Work: How to mitigate the problem? Discriminator (adversarial) trying to guess whether the sample is from Y or f (X). Incorporate pairwise similarities into loss function.

Thank you! Questions?