MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo
Pipeline of CLS-LOC task 1-1 Scoring each bounding boxes by RCNN Multiclass Object Detection with hard negative classes Input image Extract region proposals Extract CNN features fc7 1-2 Scoring whole image by FV as contextual scores with hard negative mining Scoring regions by Late fusion Score Whole image Extract FV with spacial information Scoring whole image by
Region Proposals and Feature Extraction 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by R-CNN R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014. Region proposals Selective Search J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for Object Recognition. IJCV, 2013. CNN features Single CNN model (5 conv layers, 2 fully connected layers) Pre-computed ILSVRC13 model http://www.cs.berkeley.edu/~rbg/r-cnn-release1-data-ilsvrc2013-caffe-proto-v0.tgz No fine-tuning 4096 dim fc7 features
Multiclass Object Detection 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Hard negatives classes Idea: Create negative classes and train on 2K classes A. Kanezaki, S. Inaba, Y. Ushiku, Y. Yamashita, H. Muraoka, Y. Kuniyoshi and T. Harada. Hard Negative Classes for Multiple Object Detection. ICRA, 2014. Minimize detection errors as well as classification errors algorithm with hard negative mining
Multiclass object detection (training with negative classes) We use (PA) [Crammer et al., 2006] W t+1 = arg min W 1 to learn multi-class linear classifiers. 2 W W t 2 + Cζs. t. l x i t, y i t ; W ζ, ζ 0 ERROR Wx t = w K x t = Score of class 1 Score of class 2 Score of class K r s : Positive class : Negative class with the highest score w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t
Multiclass object detection (training with negative classes) Core Idea Hard negative classes l x i t, y i t ; W ERROR w 1 w 2 w K w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 Score of negative class K Cf.) single background class w bg does not work. w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t
Multiclass object detection (training with negative classes) Ex.) If a training sample x t is a positive sample of class 2, w 1 w 2 w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 w K Score of negative class K l x i t, y i t ; W ERROR r s Classification error = class 2 : Negative class with the highest score Candidates of s: class1, 3,, or K, or negative class 2 w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t
Multiclass object detection (training with negative classes) Ex.) If a training sample x t is a negative sample of class 2, w 1 w 2 w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 w K Score of negative class K l x i t, y i t ; W ERROR s r Detection error = class 2 = negative class 2 w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t
Features for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Improved Fisher Vector F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. ECCV, 2010. INRIA's Fisher vector implementation http://lear.inrialpes.fr/src/inria_fisher/ L2 normalization, Power normalization, Spatial pyramid Parameters of IFV for all local features in our system Dimension reduction of local feature (D): 64 dim # of components in GMM (K): 256 5 scales of local patches Spatial pyramid (P): 1x1 + 2x2 + 3x1 = 8 Dimension of IFK: 2PKD=262,144 dim Local Descriptors SIFT 9
Classifiers for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by 10
Online Learning for Large-Scale Visual Recognition Three guidelines Y. Ushiku, M. Hidaka, T. Harada. Three Guidelines of Online Learning for Large-Scale Visual Recognition. CVPR, 2014. 1. Perceptron can compete against the latest methods. Provided that the second guideline is observed. 2. Averaging is necessary for any algorithm. First-order algorithms w/o averaging cannot compete against second-order algorithms. When averaging is used, the accuracies of all algorithms become very close to each other. 3. Investigate multiclass learning first. Both one-versus-the-rest learning and multiclass learning achieve similar accuracy. However, one-versus-the-rest takes much longer CPU time to converge than multiclass does. y i Averaging arg max μ x y Y \ y 1 μ T i μ 1 y i μ 2 i μ T
Late Fusion Input image 1-1 Scoring each bounding boxes by RCNN Extract region proposals Compute CNN features fc7 Multiclass PA for class 1 Multiclass PA for class j Multiclass PA for class 1000 Scoring regions by Multiclass PA for each class CNN S i,1 CNN S i,j CNN S i,1000 1-2 Scoring whole image by FV as contextual scores Whole image Extract FV with spacial information 2. Rescoring with combining RCNN feature and FV Multiclass PA for class 1 Multiclass PA for class j Multiclass PA for class 1000 Scoring by linear classifier trained by PA for each class S 1 FV S j FV FV S 1000 For bounding box i, class j, S new i,j = S CNN FV i,j S j
Results Method Localization error Classification error R-CNN feature + one-vs-all SVMs 0.631743 0.460080 R-CNN feature + multi-class PA 0.446121 0.285720 R-CNN feature + multi-class PA using hard negative classes Validation dataset 0.387516 0.227200 R-CNN feature + multi-class PA using hard negative classes, and FV 0.341743 0.18768 Test dataset Team name Localization error Classification error VGG 0.253231 0.07405 GoogLeNet 0.264414 0.14828 SYSU_Vision 0.31899 0.14446 MIL (our team) 0.337414 0.20734
Conclusion 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 1-2 Scoring whole image by FV as contextual scores with hard negative mining Scoring regions by Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Our pipeline R-CNN based region proposals and features with multi-class object detectors which create hard negative class for each positive class Global features (FVs) with multi-class online-learning Late fusion of region score and global score Combining R-CNN with the contextual information improves the localization performance. Multi-class object detector trained with hard negative classes outperforms one-vs.- the-rest SVMs.