Bilinear Attention Networks

Similar documents
Bilinear Attention Networks

arxiv: v2 [cs.cv] 19 Oct 2018

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Ch.6 Deep Feedforward Networks (2/3)

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Generalized Hadamard-Product Fusion Operators for Visual Question Answering

Based on the original slides of Hung-yi Lee

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 35: Optimization and Neural Nets

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Convolutional Neural Network Architecture

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CSC321 Lecture 16: ResNets and Attention

Expressive Efficiency and Inductive Bias of Convolutional Networks:

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Notes on Back Propagation in 4 Lines

Expressive Efficiency and Inductive Bias of Convolutional Networks:

arxiv: v2 [cs.cv] 18 Jul 2018

Deep Feedforward Neural Networks Yoshua Bengio

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

NEURAL LANGUAGE MODELS

Mitosis Detection in Breast Cancer Histology Images with Multi Column Deep Neural Networks

Deep Feedforward Networks

text classification 3: neural networks

Introduction to Deep Learning CMPT 733. Steven Bergner

Conditional Kronecker Batch Normalization for Compositional Reasoning

STA 4273H: Statistical Machine Learning

Generative Models for Sentences

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 17: Neural Networks and Deep Learning

Neural Networks Language Models

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Introduction to Neural Networks

CSC321 Lecture 2: Linear Regression

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Neural Architectures for Image, Language, and Speech Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Topic 3: Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

WaveNet: A Generative Model for Raw Audio

2017 Fall ECE 692/599: Binary Representation Learning for Large Scale Visual Data

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Introduction to Convolutional Neural Networks (CNNs)

The connection of dropout and Bayesian statistics

CS 4700: Foundations of Artificial Intelligence

Multimodal context analysis and prediction

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Deep Residual. Variations

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Spatial Transformation

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

GloVe: Global Vectors for Word Representation 1

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Lecture 4 Backpropagation

Faster Training of Very Deep Networks Via p-norm Gates

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Random Coattention Forest for Question Answering

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Introduction to Machine Learning Spring 2018 Note Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

DIVERSITY REGULARIZATION IN DEEP ENSEMBLES

STA 414/2104: Lecture 8

Maxout Networks. Hien Quoc Dang

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural networks COMS 4771

Generalized Zero-Shot Learning with Deep Calibration Network

4. Multilayer Perceptrons

Pytorch Tutorial. Xiaoyong Yuan, Xiyao Ma 2018/01

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 3: Multiclass Classification

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

Lecture 5 Neural models for NLP

Statistical Machine Learning

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

arxiv: v2 [cs.cl] 1 Jan 2019

Neural Word Embeddings from Scratch

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Segmentation of Cell Membrane and Nucleus using Branches with Different Roles in Deep Neural Network

Jakub Hajic Artificial Intelligence Seminar I

Deep Learning for NLP

From CDF to PDF A Density Estimation Method for High Dimensional Data

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks Examination, March 2004

Presented By: Omer Shmueli and Sivan Niv

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

Transcription:

ˇ»» º º»»»» º».» C D -Rom ˇ» º» fi` ` Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) fi` ` A fi` ` B Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. Seoul National University

Visual Question Answering Learning joint representation of multiple modalities Linguistic and visual information Credit: visualqa.org

VQA 2.0 Dataset 204K MS COCO images Images Questions Answers Train 80K 443K 4.4M Val 40K 214K 2.1M Test 80K 447K unknown 760K questions (10 answers for each question from unique AMT workers) train, val, test-dev (remote validation), test-standard (publications) and test-challenge (competitions), test-reserve (check overfitting) # annotations VQA Score 0 0.0 VQA Score is the avg. of 10 choose 9 accuracies https://github.com/gt-vision-lab/vqa/issues/1 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18 Goyal et al., 2017 1 0.3 2 0.6 3 0.9 >3 1.0

Objective Introducing bilinear attention - Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling Residual learning of attention - Residual learning with attention mechanism for incremental inference Integration of counting module (Zhang et al., 2018)

https://github.com/peteanderson80/bottom-up-attention Preliminary Question embedding (fine-tuning) - Use the all outputs of GRU (every time steps) - X R N ρ where N is hidden dim., and ρ= {xi} is # of tokens Image embedding (fixed bottom-up-attention) - Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes) - Y R M φ where M is feature dim., and φ= {yj} is # of objects

usessingle-channel single-channelinput input(question (question vector) vector) to combine the other multi-channel uses multi-channel input input(image (imagefeatures) features) single-channelintermediate intermediate representation representation (attended feature). asassingle-channel Low-rankbilinear bilinearmodel. model. Pirsiavash Pirsiavash et et al. al. [22] proposed a low-rank Low-rank low-rank bilinear bilinear model modelto toreduce reducethe the rankofofbilinear bilinearweight weightmatrix matrix W Wii to to give give regularity. regularity. For For this, rank this, W Wii is is replaced replaced with withthe themultiplication multiplication T N d d M d T N M d. As a result, this replacement of two smaller matrices U V, where U 2 R and V 2 R i i i of two smaller matrices Ui Vii, where Ui 2 R and Vi 2 R. As a result, this replacement makesthe therank rankof ofw Wiito tobe beat at most most dd min(n, min(n, M M ). ). For For the makes the scalar scalar output output ffii (bias (biasterms termsare areomitted omitted Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009) without loss of generality): without loss of generality): T T T T T TT T T T T T f = x W y = x U V y = 1 (U x V y) (1) i i i i i i x Ui V y = 1 (U x V y) fi = x W i y = (1) Low-rank Bilinear Pooling i i i where 1 2 is a vector of ones and denotes Hadamard product (element-wise multiplication). where 1 2 R is a vector of ones and denotes Hadamard product (element-wise multiplication). Low-rank bilinear pooling (Kim et al., 2017) Low-rank bilinear pooling. For a vector output f, a pooling matrix P is introduced: Low-rank bilinear pooling. For a vector output f, a pooling matrix P is introduced: d Rd T T PT (UT x T V T y) ff = = P (U x V y) (2) (2) d c N d M d where P 2 R d c, U 2 RN d, and V 2 RM d. It allows U and V to be two-dimensional tensors where P 2R, Uoutput, 2 R instead, and Vof 2 using R c three-dimensional. It allows U and V to be two-dimensional For vector tensors U and V, tensors by introducing P for a vector output f 2 R c, significantly reducing the number of parameters. by introducing P for a vector output f 2 R, significantly reducing the number of parameters. replace the vector of ones with a pooling matrix P (use three two-dimensional Unitary attention networks. Attention provides an efficient mechanism to reduce input channel Unitarytensors). attention networks. Attention provides an efficient mechanism to reduce input channel by selectively utilizing given information. Assuming that a multi-channel input Y consisting of by selectively utilizing given information. Assuming that a multi-channel input Y consisting of ˆ from Y using the weights { i }: = {yi } column vectors, we want to get single channel y ˆ = {yi } column vectors, we want to get single channel y from Y using the weights { }: i X X ˆ= y i y i (3)

Unitary Attention This pooling is used to get attention weights with a question embedding (single-channel) vector and visual feature vectors (multi-channel) as the two inputs. We call it unitary attention since a question embedding vector queried the feature vectors, unidirectionally. Q Linear Tanh Replicate Linear Tanh Linear Softmax A V Conv Tanh Conv Softmax Linear Tanh Kim et al., 2017

Bilinear Attention Maps K φ φ ρ K = ρ U and V are linear embeddings p is a learnable projection vector A element-wise multiplication A := softmax K M M φ (1 p T ) X T U V T Y ρ φ ρ K ρ N N K ρ K K φ

Bilinear Attention Maps K φ φ ρ K = ρ Exactly the same approach with low-rank bilinear pooling element-wise A multiplication A := softmax (1 p T ) X T U V T Y A i,j = p T (U T X i ) (V T Y j ).

Bilinear Attention Maps K φ φ ρ K = ρ Multiple bilinear attention maps are acquired by different projection vectors pg as: A g := softmax (1 p T ) X T U V T Y g

Bilinear Attention Networks Each multimodal joint feature is filled with following equation (k is the index of K; broadcasting in PyTorch let you avoid for-loop for this): A 2 { } { } ρ φ f 0 k =(X T U 0 ) T k A(Y T V 0 ) k k-th K 1 ρ φ 1 0 2 { } 0 2 { } φ 1 k-th 1 ρ ρ φ = k-th K K broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch

Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation A 2 { } { } f 0 k = f 0 k =(X T U 0 ) T k A(Y T V 0 ) k 0 2 { } 0 2 {x i } X {y j } X { } A i,j (X T i U 0 k)(v 0T k Y j ) = i=1 {x i } X i=1 j=1 {y j } X j=1 A i,j X T i (U 0 kv 0T k )Y j low-rank bilinear pooling

Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation Low-rank bilinear feature learning inside bilinear attention f 0 k = {x i } X {y j } X A i,j (X T i U 0 k)(v 0T k Y j ) i=1 j=1 = {x i } X {y j } X A i,j X T i (U 0 kv 0T k )Y j i=1 j=1 low-rank bilinear pooling

Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation Low-rank bilinear feature learning inside bilinear attention Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied

Bilinear Attention Networks X Y Linear ReLU Linear ReLU transpose Linear ReLU Linear Softmax Linear ReLU transpose Linear Classifier

Time Complexity Assuming that M N > K > φ ρ, the time complexity of bilinear attention networks is O(KMφ) where K denotes hidden size, since BAN consists of matrix chain multiplication Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch Largely due to the increment of input size for Softmax function, φ to φ ρ

Residual Learning of Attention Residual learning exploits multiple attention maps (f0 is X and {ɑi} is fixed to ones): bilinear attention map f i+1 = BAN i (f i, Y; A i ) 1 T + i f i shortcut bilinear attention networks repeat {# of tokens} times

Overview After getting bilinear attention maps, we can stack multiple BANs. Residual Learning X T U ρ What is the mustache made of? GRU K p All hidden states X φ V T Y Y Object Detection φ K = att_1 ρ att_2 Softmax K K ρ 1 ρ φ U T X Y T V K=N Residual Learning U T X ρ 1 ρ Att_1 φ Att_2 1 1 K Y T V K φ φ X = = X + repeat 1 ρ X + K repeat 1 ρ Sum Pooling MLP classifier Step 1. Bilinear Attention Maps Step 2. Bilinear Attention Networks

Multiple Attention Maps Single model on validation score Validation VQA 2.0 Score +% Bottom-Up (Teney et al., 2017) 63.37 BAN-1 65.36 +1.99 BAN-2 65.61 +0.25 BAN-4 65.81 +0.2 BAN-8 66.00 +0.19 BAN-12 66.04 +0.04

Residual Learning Validation VQA 2.0 Score +/- BAN-4 (Residual) 65.81 ±0.09 i BAN i (X,Y; A i )! BAN (X,Y; A ) i i i BAN-4 (Sum) 64.78 ±0.08-1.03 BAN-4 (Concat) 64.71 ±0.21-0.07

Comparison with Co-attention Validation VQA 2.0 Score +/- BAN-1 Bilinear Attention 65.36 ±0.14 Co-Attention 64.79 ±0.06-0.57 Attention 64.59 ±0.04-0.20 The number of parameters is controlled (all comparison models have 32M parameters).

Valid Comparison with 35 Co-attention 45 uni-att train bi-att val co-att val uni-att val 1 3 5 7 9 11 13 15 17 19 Valid 45 40 35 BAN-2 BAN-4 BAN-8 BAN-12 0 2 4 6 8 10 Epoch Used glimpses (a) 85 (a) 85 (b) (b) 67.0 (c) 70 Validation score 74 63 52 41 30 Validation score 74 63 Bi-Att train Co-Att train Uni-Att train Bi-Att val Co-Att val Uni-Att val 52 41 30 1 4 7 10 13 16 19 Training curves Validation curves Bi-Att train Co-Att train Uni-Att train Bi-Att val Co-Att val Uni-Att val 1 4 7 10 13 16 19 Validation score 65.5 64.0 62.5 61.0 59.5 58.0 Uni-Att Co-Att BAN-1 BAN-4 BAN-1+MFB 0M 15M 30M 45M 60M Validation score 65 60 55 50 45 40 35 0 2 4 Epoch Epoch The number of parameters The number of parameters Used g The number of parameters is controlled (all comparison models have 32M parameters).

Visualization

Integration of Counting module with BAN Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info Our method maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning ɸ Softmax Bilinear Attention ρ Logits Maxout Sigmoid Counting Module f i+1 = BAN i (f i, Y; A i )+g i (c i ) 1 T + f i 1 ɸ maxout inference steps for multi-glimpse attention linear embedding of the i-th output of counter module residual learning of attention

Comparison with State-of-the-arts test-dev Zhang et al. (2018) Numbers 51.62 Ours 54.04 Single model on test-dev score Test-dev VQA 2.0 Score Prior 25.70 +% Language-Only 44.22 +18.52% 2016 winner 2017 winner 2017 runner-up MCB (ResNet) 61.96 +17.74% Bottom-Up (FRCNN) 65.32 +3.36% MFH (ResNet) 65.80 +0.48% MFH (FRCNN) 68.76 +2.96% BAN (Ours; FRCNN) 69.52 +0.76% BAN-Glove (Ours; FRCNN) 69.66 +0.14% BAN-Glove-Counter (Ours; FRCNN) 70.04 +0.38% image feature attention model counting feature

Flickr30k Entities Visual grounding task mapping entity phrases to regions in an image 0.48<0.5 [/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit], [/ EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball].

Flickr30k Entities Visual grounding task mapping entity phrases to regions in an image [/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/ EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number].

Flickr30k Entities Recall@1,5,10 R@1 R@5 R@10 Zhang et al. (2016) 27.0 49.9 57.7 SCRC (2016) 27.8-62.9 DSPE (2016) 43.89 64.46 68.66 GroundeR (2016) 47.81 - - MCB (2016) 48.69 - - CCA (2017) 50.89 71.09 75.73 Yeh et al. (2017) 53.97 - - Hinami & Satoh (2017) 65.21 BAN (ours) 66.69 84.22 86.35

Flickr30k Entities Recall@1,5,10 people clothing bodyparts animals vehicles instruments scene other #Instances 5,656 2,306 523 518 400 162 1,619 3,374 GroundeR (2016) 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08 CCA (2017) 64.73 46.88 17.21 65.83 68.75 37.65 51.39 31.77 Yeh et al. (2017) 68.71 46.83 19.50 70.07 73.75 39.50 60.38 32.45 Hinami & Satoh (2017) 78.17 61.99 35.25 74.41 76.16 56.69 68.07 47.42 BAN (ours) 79.90 74.95 47.23 81.85 76.92 43.00 68.69 51.33

Conclusions Bilinear attention networks gracefully extends unitary attention networks, as low-rank bilinear pooling inside bilinear attention. Furthermore, residual learning of attention efficiently uses multiple attention maps. 2018 VQA Challenge runners-up (shared 2nd place) 1st single model (70.35 on test-standard) Challenge Runner-Ups Joint Runner-Up Team 1 SNU-BI Jin-Hwa Kim (Seoul National University) Jaehyun Jun (Seoul National University) Byoung-Tak Zhang (Seoul National University & Surromind Robotics) Challenge Accuracy: 71.69

arxiv:1805.07932 Code & pretrained model in PyTorch: github.com/jnhwkim/ban-vqa Thank you! Any question? We would like to thank Kyoung-Woon On, Bohyung Han, and Hyeonwoo Noh for helpful comments and discussion. This work was supported by 2017 Google Ph.D. Fellowship in Machine Learning and Ph.D. Completion Scholarship from College of Humanities, Seoul National University, and the Korea government (IITP-2017-0-01772-VTT, IITP-R0126-16-1072- SW.StarLab, 2018-0-00622-RMI, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF). The part of computing resources used in this study was generously shared by Standigm Inc.