Laconic: Label Consistency for Image Categorization

Size: px

Start display at page:

Download "Laconic: Label Consistency for Image Categorization"

Stewart Lang
5 years ago
Views:

1 1 Laconic: Label Consistency for Image Categorization Samy Bengio, Google with Jeff Dean, Eugene Ie, Dumitru Erhan, Quoc Le, Andrew Rabinovich, Jon Shlens, and Yoram Singer

2 2 Motivation WHAT IS THE OCCLUDED OBJECT?

3 Motivation 2

4 Motivation 2

5 Speech Recognition Success Modern speech recognition systems consist of two building blocks: Acoustic model (what is the most likely phoneme at this temporal window?) Language model (what is the most likely sequence of words that was uttered?) Both models are combined in the decoding phase Efficient because speech is approximately Markov 3

6 Language Models for Images? Can we apply a similar language modeling approach to image recognition? More challenging since it is more difficult to impose lexical structure in an image Simplify and consider a co-occurrence model: a cobra on a desk is a rare event Use the web to collect A LOT OF co-occurrence statistics 4

7 5 Related Work Plenty of work on using ontologies like WordNet [Marszalek et al, CVPR2007; Weston et. al, NIPS2012; Deng et. al, NIPS2011,...] Other Co-occurrence models ( see for instance [Rabinovitch et. al, ICCV2007], [Torralba et al, NIPS2005] [Ladicky et al, ECCV2007], etc) Not use a LARGE separate source of language model. No efficient optimization over the potential subsets of labels: often just try all of them... doesn t scale! Here we focus on an efficient scalable solution with a large separate source of language model. Others (various graphical models, often not scalable)

8 The Laconic Model Label Consistency for Image Categorization We look for a small set of labels that best describe an image and are consistent w.r.t each other best describe an image : use a patch classifier (we use a deep learning approach) consistent with each other : use web co-occurrence statistics 6

9 7 Co-Occurrence Model Harvest all web pages and gather statistics about appearances of terms describing classes s(i, j) = log P (i, j) P (i)p (j) Consider a co-occurrence if two terms appear nearby (say K words apart) on a web page. Unbounded value: positive values imply correlation, negative values imply anti-correlation. ( 1 if s(i, j) > 0 S ij = 1+exp( s(i,j)) 0 otherwise Result is bounded between 0 and 1.

10 8

11 9 Label Voting Model We use an already trained model (Google s deep learning model) to cast a score for each label at various positions and scales of an image To simplify, we aggregate them: P w(p)v(i, p) p2p µ i = P p2p w(p) w(p) is the weight of patch p v(i,p) is DeepNet score for the label-patch pair (i,p) Patch weights can take into account position & size

12 Inference as an Optimization Problem We want to jointly maximize consistency and label votes, under domain constraints: 10

13 Inference as an Optimization Problem We want to jointly maximize consistency and label min votes, under domain constraints: Q( µ,s)=e( µ)+ C( S)+ R( ) s.t. 2 10

14 10 Inference as an Optimization Problem We want to jointly maximize consistency and label min votes, under domain constraints: Q( µ,s)=e( µ)+ C( S)+ R( ) s.t. 2 E( µ) = ( µt [linear] P j j log j µ j [relative entropy]

15 10 Inference as an Optimization Problem We want to jointly maximize consistency and label min votes, under domain constraints: C( S) = Q( µ,s)=e( µ)+ C( S)+ R( ) s.t. 2 E( µ) = ( µt [linear] P 8 >< T S P >: i,j S i,j j j log j µ j [relative entropy] ( d 2 ij /2 d ij apple 1 ( d ij 1/2) d ij > 1 [Ising] [Huber] where d ij = i j

16 10 Inference as an Optimization Problem We want to jointly maximize consistency and label min votes, under domain constraints: C( S) = Q( µ,s)=e( µ)+ C( S)+ R( ) s.t. 2 E( µ) = ( µt [linear] P 8 >< T S P >: i,j S i,j j j log j µ j [relative entropy] ( d 2 ij /2 d ij apple 1 ( d ij 1/2) d ij > 1 [Ising] [Huber] where d ij = i j R( ) = T

17 Inference as an Optimization Problem We want to jointly maximize consistency and label min votes, under domain constraints: C( S) = Q( µ,s)=e( µ)+ C( S)+ R( ) s.t. 2 E( µ) = ( µt [linear] P 8 >< T S P >: i,j S i,j j j log j µ j [relative entropy] ( d 2 ij /2 d ij apple 1 ( d ij 1/2) d ij > 1 [Ising] [Huber] R( ) = T = 8 < X : j 10 where d ij = i j 9 = j apple N,k k 1 apple 1, 8j : j 0 ;

18 11 Pseudocode Until Approximation Accuracy Satisfied: 1. Apply gradient descent on the objective: v t = t Project onto feasible set by solving: t = arg min 1 2 k v tk s.t. 0 1, k k 1 apple N

19 12 When To Use Laconic Hypothesis: a label consistency model may help if noise level of the predictions of the patch classifier is higher than that of the co-occurrence model

20 13 Initial Experiments - Dataset ImageNet detection subtask 3623 classes with bounding boxes About 1M images in the test set. Sun 12 Multiple labels per image (total of 3765 labels) About 1000 images in the test set. Also access to training-set co-occurrence matrix

21 14 Comparing Laconic Approaches Statistics Linear - Ising Entropy - Ising Linear - Ising - Projection Entropy - Ising - Projection Linear - Huber - Projection Prec@1 86.2% 86.3% 86.4% 85.8% 85.8% Prec@5 53.1% 52.8% 52.8% 52.6% 50.1% Average Precision 76.3% 76.2% 76.2% 75.9% 74.9% OUT OF 987 TEST IMAGES FROM SUN 12 DATASET

22 15 Results for Sun 12 Statistics Baseline Laconic-web Laconictrain Laconic- Mixture 85.7% 86.2% 85.5% 86.1% 50.6% 53.1% 50.8% 53.7% Average Precision 75.0% 76.3% 75.1% 76.6% OUT OF 987 TEST IMAGES FROM SUN 12 DATASET

23 16 Results for ImageNet-3623 classes Statistics Baseline Laconic Relative Improvement 24.8% 34.3% 38.3% 44.4% 52.4% 18.0% 13.7% 19.6% 43.1% 25.8% 28.4% 10.1% OUT OF TEST IMAGES FROM IMAGENET DETECTION TASK

24 - Laconic vs Baseline 17

25 18 Examples (Sun 12) Cabinet Worktop Floor Cabinet Trees Trees Door Floor Door Door Trees Window Person Floor -black: ground truth -green: laconic -blue: baseline

26 19 Examples (ImageNet) -red: ground truth -green: baseline -blue: laconic

27 Conclusion Use external language model to regularize image categorization efficiently. We need a good source of co-occurrences. Experiments on a subset of ImageNet and on Sun 12 show promising results. More structure can be added to the model: gather sentences like: the sky is above the mountain, the monitor is on the table... 20

Deep Feedforward Neural Networks Yoshua Bengio

Deep Feedforward Neural Networks Yoshua Bengio August 28-29th, 2017 @ DS3 Data Science Summer School Input Layer, Hidden Layers, Output Layer 2 Output layer Here predicting a supervised target Hidden layers