ˇ»» º º»»»» º».» C D -Rom ˇ» º» fi` ` Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) fi` ` A fi` ` B Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. Seoul National University
Visual Question Answering Learning joint representation of multiple modalities Linguistic and visual information Credit: visualqa.org
VQA 2.0 Dataset 204K MS COCO images Images Questions Answers Train 80K 443K 4.4M Val 40K 214K 2.1M Test 80K 447K unknown 760K questions (10 answers for each question from unique AMT workers) train, val, test-dev (remote validation), test-standard (publications) and test-challenge (competitions), test-reserve (check overfitting) # annotations VQA Score 0 0.0 VQA Score is the avg. of 10 choose 9 accuracies https://github.com/gt-vision-lab/vqa/issues/1 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18 Goyal et al., 2017 1 0.3 2 0.6 3 0.9 >3 1.0
Objective Introducing bilinear attention - Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling Residual learning of attention - Residual learning with attention mechanism for incremental inference Integration of counting module (Zhang et al., 2018)
https://github.com/peteanderson80/bottom-up-attention Preliminary Question embedding (fine-tuning) - Use the all outputs of GRU (every time steps) - X R N ρ where N is hidden dim., and ρ= {xi} is # of tokens Image embedding (fixed bottom-up-attention) - Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes) - Y R M φ where M is feature dim., and φ= {yj} is # of objects
usessingle-channel single-channelinput input(question (question vector) vector) to combine the other multi-channel uses multi-channel input input(image (imagefeatures) features) single-channelintermediate intermediate representation representation (attended feature). asassingle-channel Low-rankbilinear bilinearmodel. model. Pirsiavash Pirsiavash et et al. al. [22] proposed a low-rank Low-rank low-rank bilinear bilinear model modelto toreduce reducethe the rankofofbilinear bilinearweight weightmatrix matrix W Wii to to give give regularity. regularity. For For this, rank this, W Wii is is replaced replaced with withthe themultiplication multiplication T N d d M d T N M d. As a result, this replacement of two smaller matrices U V, where U 2 R and V 2 R i i i of two smaller matrices Ui Vii, where Ui 2 R and Vi 2 R. As a result, this replacement makesthe therank rankof ofw Wiito tobe beat at most most dd min(n, min(n, M M ). ). For For the makes the scalar scalar output output ffii (bias (biasterms termsare areomitted omitted Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009) without loss of generality): without loss of generality): T T T T T TT T T T T T f = x W y = x U V y = 1 (U x V y) (1) i i i i i i x Ui V y = 1 (U x V y) fi = x W i y = (1) Low-rank Bilinear Pooling i i i where 1 2 is a vector of ones and denotes Hadamard product (element-wise multiplication). where 1 2 R is a vector of ones and denotes Hadamard product (element-wise multiplication). Low-rank bilinear pooling (Kim et al., 2017) Low-rank bilinear pooling. For a vector output f, a pooling matrix P is introduced: Low-rank bilinear pooling. For a vector output f, a pooling matrix P is introduced: d Rd T T PT (UT x T V T y) ff = = P (U x V y) (2) (2) d c N d M d where P 2 R d c, U 2 RN d, and V 2 RM d. It allows U and V to be two-dimensional tensors where P 2R, Uoutput, 2 R instead, and Vof 2 using R c three-dimensional. It allows U and V to be two-dimensional For vector tensors U and V, tensors by introducing P for a vector output f 2 R c, significantly reducing the number of parameters. by introducing P for a vector output f 2 R, significantly reducing the number of parameters. replace the vector of ones with a pooling matrix P (use three two-dimensional Unitary attention networks. Attention provides an efficient mechanism to reduce input channel Unitarytensors). attention networks. Attention provides an efficient mechanism to reduce input channel by selectively utilizing given information. Assuming that a multi-channel input Y consisting of by selectively utilizing given information. Assuming that a multi-channel input Y consisting of ˆ from Y using the weights { i }: = {yi } column vectors, we want to get single channel y ˆ = {yi } column vectors, we want to get single channel y from Y using the weights { }: i X X ˆ= y i y i (3)
Unitary Attention This pooling is used to get attention weights with a question embedding (single-channel) vector and visual feature vectors (multi-channel) as the two inputs. We call it unitary attention since a question embedding vector queried the feature vectors, unidirectionally. Q Linear Tanh Replicate Linear Tanh Linear Softmax A V Conv Tanh Conv Softmax Linear Tanh Kim et al., 2017
Bilinear Attention Maps K φ φ ρ K = ρ U and V are linear embeddings p is a learnable projection vector A element-wise multiplication A := softmax K M M φ (1 p T ) X T U V T Y ρ φ ρ K ρ N N K ρ K K φ
Bilinear Attention Maps K φ φ ρ K = ρ Exactly the same approach with low-rank bilinear pooling element-wise A multiplication A := softmax (1 p T ) X T U V T Y A i,j = p T (U T X i ) (V T Y j ).
Bilinear Attention Maps K φ φ ρ K = ρ Multiple bilinear attention maps are acquired by different projection vectors pg as: A g := softmax (1 p T ) X T U V T Y g
Bilinear Attention Networks Each multimodal joint feature is filled with following equation (k is the index of K; broadcasting in PyTorch let you avoid for-loop for this): A 2 { } { } ρ φ f 0 k =(X T U 0 ) T k A(Y T V 0 ) k k-th K 1 ρ φ 1 0 2 { } 0 2 { } φ 1 k-th 1 ρ ρ φ = k-th K K broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch
Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation A 2 { } { } f 0 k = f 0 k =(X T U 0 ) T k A(Y T V 0 ) k 0 2 { } 0 2 {x i } X {y j } X { } A i,j (X T i U 0 k)(v 0T k Y j ) = i=1 {x i } X i=1 j=1 {y j } X j=1 A i,j X T i (U 0 kv 0T k )Y j low-rank bilinear pooling
Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation Low-rank bilinear feature learning inside bilinear attention f 0 k = {x i } X {y j } X A i,j (X T i U 0 k)(v 0T k Y j ) i=1 j=1 = {x i } X {y j } X A i,j X T i (U 0 kv 0T k )Y j i=1 j=1 low-rank bilinear pooling
Bilinear Attention Networks One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation Low-rank bilinear feature learning inside bilinear attention Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied
Bilinear Attention Networks X Y Linear ReLU Linear ReLU transpose Linear ReLU Linear Softmax Linear ReLU transpose Linear Classifier
Time Complexity Assuming that M N > K > φ ρ, the time complexity of bilinear attention networks is O(KMφ) where K denotes hidden size, since BAN consists of matrix chain multiplication Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch Largely due to the increment of input size for Softmax function, φ to φ ρ
Residual Learning of Attention Residual learning exploits multiple attention maps (f0 is X and {ɑi} is fixed to ones): bilinear attention map f i+1 = BAN i (f i, Y; A i ) 1 T + i f i shortcut bilinear attention networks repeat {# of tokens} times
Overview After getting bilinear attention maps, we can stack multiple BANs. Residual Learning X T U ρ What is the mustache made of? GRU K p All hidden states X φ V T Y Y Object Detection φ K = att_1 ρ att_2 Softmax K K ρ 1 ρ φ U T X Y T V K=N Residual Learning U T X ρ 1 ρ Att_1 φ Att_2 1 1 K Y T V K φ φ X = = X + repeat 1 ρ X + K repeat 1 ρ Sum Pooling MLP classifier Step 1. Bilinear Attention Maps Step 2. Bilinear Attention Networks
Multiple Attention Maps Single model on validation score Validation VQA 2.0 Score +% Bottom-Up (Teney et al., 2017) 63.37 BAN-1 65.36 +1.99 BAN-2 65.61 +0.25 BAN-4 65.81 +0.2 BAN-8 66.00 +0.19 BAN-12 66.04 +0.04
Residual Learning Validation VQA 2.0 Score +/- BAN-4 (Residual) 65.81 ±0.09 i BAN i (X,Y; A i )! BAN (X,Y; A ) i i i BAN-4 (Sum) 64.78 ±0.08-1.03 BAN-4 (Concat) 64.71 ±0.21-0.07
Comparison with Co-attention Validation VQA 2.0 Score +/- BAN-1 Bilinear Attention 65.36 ±0.14 Co-Attention 64.79 ±0.06-0.57 Attention 64.59 ±0.04-0.20 The number of parameters is controlled (all comparison models have 32M parameters).
Valid Comparison with 35 Co-attention 45 uni-att train bi-att val co-att val uni-att val 1 3 5 7 9 11 13 15 17 19 Valid 45 40 35 BAN-2 BAN-4 BAN-8 BAN-12 0 2 4 6 8 10 Epoch Used glimpses (a) 85 (a) 85 (b) (b) 67.0 (c) 70 Validation score 74 63 52 41 30 Validation score 74 63 Bi-Att train Co-Att train Uni-Att train Bi-Att val Co-Att val Uni-Att val 52 41 30 1 4 7 10 13 16 19 Training curves Validation curves Bi-Att train Co-Att train Uni-Att train Bi-Att val Co-Att val Uni-Att val 1 4 7 10 13 16 19 Validation score 65.5 64.0 62.5 61.0 59.5 58.0 Uni-Att Co-Att BAN-1 BAN-4 BAN-1+MFB 0M 15M 30M 45M 60M Validation score 65 60 55 50 45 40 35 0 2 4 Epoch Epoch The number of parameters The number of parameters Used g The number of parameters is controlled (all comparison models have 32M parameters).
Visualization
Integration of Counting module with BAN Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info Our method maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning ɸ Softmax Bilinear Attention ρ Logits Maxout Sigmoid Counting Module f i+1 = BAN i (f i, Y; A i )+g i (c i ) 1 T + f i 1 ɸ maxout inference steps for multi-glimpse attention linear embedding of the i-th output of counter module residual learning of attention
Comparison with State-of-the-arts test-dev Zhang et al. (2018) Numbers 51.62 Ours 54.04 Single model on test-dev score Test-dev VQA 2.0 Score Prior 25.70 +% Language-Only 44.22 +18.52% 2016 winner 2017 winner 2017 runner-up MCB (ResNet) 61.96 +17.74% Bottom-Up (FRCNN) 65.32 +3.36% MFH (ResNet) 65.80 +0.48% MFH (FRCNN) 68.76 +2.96% BAN (Ours; FRCNN) 69.52 +0.76% BAN-Glove (Ours; FRCNN) 69.66 +0.14% BAN-Glove-Counter (Ours; FRCNN) 70.04 +0.38% image feature attention model counting feature
Flickr30k Entities Visual grounding task mapping entity phrases to regions in an image 0.48<0.5 [/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit], [/ EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball].
Flickr30k Entities Visual grounding task mapping entity phrases to regions in an image [/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/ EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number].
Flickr30k Entities Recall@1,5,10 R@1 R@5 R@10 Zhang et al. (2016) 27.0 49.9 57.7 SCRC (2016) 27.8-62.9 DSPE (2016) 43.89 64.46 68.66 GroundeR (2016) 47.81 - - MCB (2016) 48.69 - - CCA (2017) 50.89 71.09 75.73 Yeh et al. (2017) 53.97 - - Hinami & Satoh (2017) 65.21 BAN (ours) 66.69 84.22 86.35
Flickr30k Entities Recall@1,5,10 people clothing bodyparts animals vehicles instruments scene other #Instances 5,656 2,306 523 518 400 162 1,619 3,374 GroundeR (2016) 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08 CCA (2017) 64.73 46.88 17.21 65.83 68.75 37.65 51.39 31.77 Yeh et al. (2017) 68.71 46.83 19.50 70.07 73.75 39.50 60.38 32.45 Hinami & Satoh (2017) 78.17 61.99 35.25 74.41 76.16 56.69 68.07 47.42 BAN (ours) 79.90 74.95 47.23 81.85 76.92 43.00 68.69 51.33
Conclusions Bilinear attention networks gracefully extends unitary attention networks, as low-rank bilinear pooling inside bilinear attention. Furthermore, residual learning of attention efficiently uses multiple attention maps. 2018 VQA Challenge runners-up (shared 2nd place) 1st single model (70.35 on test-standard) Challenge Runner-Ups Joint Runner-Up Team 1 SNU-BI Jin-Hwa Kim (Seoul National University) Jaehyun Jun (Seoul National University) Byoung-Tak Zhang (Seoul National University & Surromind Robotics) Challenge Accuracy: 71.69
arxiv:1805.07932 Code & pretrained model in PyTorch: github.com/jnhwkim/ban-vqa Thank you! Any question? We would like to thank Kyoung-Woon On, Bohyung Han, and Hyeonwoo Noh for helpful comments and discussion. This work was supported by 2017 Google Ph.D. Fellowship in Machine Learning and Ph.D. Completion Scholarship from College of Humanities, Seoul National University, and the Korea government (IITP-2017-0-01772-VTT, IITP-R0126-16-1072- SW.StarLab, 2018-0-00622-RMI, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF). The part of computing resources used in this study was generously shared by Standigm Inc.