MIL-UT at ILSVRC2014

Similar documents
CS230: Lecture 5 Advanced topics in Object Detection

Compressed Fisher vectors for LSVR

Fisher Vector image representation

Fine-grained Classification

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Distinguishing Weather Phenomena from Bird Migration Patterns in Radar Imagery

Representing Sets of Instances for Visual Recognition

HeadNet: Pedestrian Head Detection Utilizing Body in Context

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Higher-order Statistical Modeling based Deep CNNs (Part-I)

arxiv: v2 [cs.cv] 15 Jul 2018

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

TRAFFIC SCENE RECOGNITION BASED ON DEEP CNN AND VLAD SPATIAL PYRAMIDS

38 1 Vol. 38, No ACTA AUTOMATICA SINICA January, Bag-of-phrases.. Image Representation Using Bag-of-phrases

Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Supplementary Material

Urban land use information retrieval based on scene classification of Google Street View images

Beyond Spatial Pyramids

Distinguishing Weather Phenomena from Bird Migration Patterns in Radar Imagery

Support vector machines Lecture 4

The state of the art and beyond

Convolutional Neural Network Architecture

arxiv: v1 [cs.cv] 24 Nov 2017

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers

Learning to Rank Using Privileged Information

Binary Convolutional Neural Network on RRAM

ML4NLP Multiclass Classification

Deep Scene Image Classification with the MFAFVNet

Online Passive- Aggressive Algorithms

OBJECT DETECTION FROM MMS IMAGERY USING DEEP LEARNING FOR GENERATION OF ROAD ORTHOPHOTOS

Lecture 13 Visual recognition

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Domain adaptation for deep learning

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Fantope Regularization in Metric Learning

Loss Functions and Optimization. Lecture 3-1

SELF-SUPERVISION. Nate Russell, Christian Followell, Pratik Lahiri

FAemb: a function approximation-based embedding method for image retrieval

MIRA, SVM, k-nn. Lirong Xia

Accelerating Very Deep Convolutional Networks for Classification and Detection

Linear Models for Classification

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. Department of Computer Science TU Darmstadt

Towards Good Practices for Action Video Encoding

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

Kai Yu NEC Laboratories America, Cupertino, California, USA

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Discriminative part-based models. Many slides based on P. Felzenszwalb

Flower species identification using deep convolutional neural networks

Learning to Rank and Quadratic Assignment

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Self-Adaptable Templates for Feature Coding

Kernel Learning for Multi-modal Recognition Tasks

SCENE CLASSIFICATION USING SPATIAL RELATIONSHIP BETWEEN LOCAL POSTERIOR PROBABILITIES

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Statistical learning theory, Support vector machines, and Bioinformatics

Deep Learning. Hung-yi Lee 李宏毅

SCENE CLASSIFICATION USING SPATIAL RELATIONSHIP BETWEEN LOCAL POSTERIOR PROBABILITIES

Deep Learning (CNNs)

Machine learning for pervasive systems Classification in high-dimensional spaces

arxiv: v2 [cs.cv] 7 Feb 2018

Distinguish between different types of scenes. Matching human perception Understanding the environment

Use Bin-Ratio Information for Category and Scene Classification

Evaluation. Andrea Passerini Machine Learning. Evaluation

Histogram of multi-directional Gabor filter bank for motion trajectory feature extraction

Evaluation requires to define performance measures to be optimized

Learning Discriminative and Transformation Covariant Local Feature Detectors

Matrix Backpropagation for Deep Networks with Structured Layers

arxiv: v1 [cs.lg] 25 Jul 2017

Maximally Stable Local Description for Scale Selection

from Object Image Based on

Multi-Component Word Sense Disambiguation

Scale-adaptive Convolutions for Scene Parsing

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning Basics

Kronecker Decomposition for Image Classification

Image Classification with the Fisher Vector: Theory and Practice

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images

DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION

EE 6882 Visual Search Engine

A Simple Exponential Family Framework for Zero-Shot Learning

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Warm up: risk prediction with logistic regression

arxiv: v4 [cs.cv] 6 Sep 2017

Reduced Memory Region Based Deep Convolutional Neural Network Detection

On the road to Affine Scattering Transform

A Least Squares Formulation for Canonical Correlation Analysis

Information Extraction from Text

Jakub Hajic Artificial Intelligence Seminar I

Deep learning attracts lots of attention.

Unsupervised Domain Adaptation with Residual Transfer Networks

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Self-Tuning Semantic Image Segmentation

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

MinOver Revisited for Incremental Support-Vector-Classification

CSC321 Lecture 16: ResNets and Attention

Region Moments: Fast invariant descriptors for detecting small image structures

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

ECS289: Scalable Machine Learning

Transcription:

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo

Pipeline of CLS-LOC task 1-1 Scoring each bounding boxes by RCNN Multiclass Object Detection with hard negative classes Input image Extract region proposals Extract CNN features fc7 1-2 Scoring whole image by FV as contextual scores with hard negative mining Scoring regions by Late fusion Score Whole image Extract FV with spacial information Scoring whole image by

Region Proposals and Feature Extraction 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by R-CNN R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, 2014. Region proposals Selective Search J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for Object Recognition. IJCV, 2013. CNN features Single CNN model (5 conv layers, 2 fully connected layers) Pre-computed ILSVRC13 model http://www.cs.berkeley.edu/~rbg/r-cnn-release1-data-ilsvrc2013-caffe-proto-v0.tgz No fine-tuning 4096 dim fc7 features

Multiclass Object Detection 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Hard negatives classes Idea: Create negative classes and train on 2K classes A. Kanezaki, S. Inaba, Y. Ushiku, Y. Yamashita, H. Muraoka, Y. Kuniyoshi and T. Harada. Hard Negative Classes for Multiple Object Detection. ICRA, 2014. Minimize detection errors as well as classification errors algorithm with hard negative mining

Multiclass object detection (training with negative classes) We use (PA) [Crammer et al., 2006] W t+1 = arg min W 1 to learn multi-class linear classifiers. 2 W W t 2 + Cζs. t. l x i t, y i t ; W ζ, ζ 0 ERROR Wx t = w K x t = Score of class 1 Score of class 2 Score of class K r s : Positive class : Negative class with the highest score w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t

Multiclass object detection (training with negative classes) Core Idea Hard negative classes l x i t, y i t ; W ERROR w 1 w 2 w K w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 Score of negative class K Cf.) single background class w bg does not work. w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t

Multiclass object detection (training with negative classes) Ex.) If a training sample x t is a positive sample of class 2, w 1 w 2 w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 w K Score of negative class K l x i t, y i t ; W ERROR r s Classification error = class 2 : Negative class with the highest score Candidates of s: class1, 3,, or K, or negative class 2 w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t

Multiclass object detection (training with negative classes) Ex.) If a training sample x t is a negative sample of class 2, w 1 w 2 w K x t = Score of class 1 Score of class 2 Score of class K w 1 w 2 Score of negative class 1 Score of negative class 2 w K Score of negative class K l x i t, y i t ; W ERROR s r Detection error = class 2 = negative class 2 w r (t+1) = wr (t) + τt x t w s (t+1) = ws (t) τt x t where τ t = min C, 1 (w t T t r x t w T s x t ) 2 x 2 t

Features for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Improved Fisher Vector F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. ECCV, 2010. INRIA's Fisher vector implementation http://lear.inrialpes.fr/src/inria_fisher/ L2 normalization, Power normalization, Spatial pyramid Parameters of IFV for all local features in our system Dimension reduction of local feature (D): 64 dim # of components in GMM (K): 256 5 scales of local patches Spatial pyramid (P): 1x1 + 2x2 + 3x1 = 8 Dimension of IFK: 2PKD=262,144 dim Local Descriptors SIFT 9

Classifiers for Contextual Scores 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 with hard negative mining Scoring regions by 1-2 Scoring whole image by FV as contextual scores Late fusion Score Whole image Extract FV with spacial information Scoring whole image by 10

Online Learning for Large-Scale Visual Recognition Three guidelines Y. Ushiku, M. Hidaka, T. Harada. Three Guidelines of Online Learning for Large-Scale Visual Recognition. CVPR, 2014. 1. Perceptron can compete against the latest methods. Provided that the second guideline is observed. 2. Averaging is necessary for any algorithm. First-order algorithms w/o averaging cannot compete against second-order algorithms. When averaging is used, the accuracies of all algorithms become very close to each other. 3. Investigate multiclass learning first. Both one-versus-the-rest learning and multiclass learning achieve similar accuracy. However, one-versus-the-rest takes much longer CPU time to converge than multiclass does. y i Averaging arg max μ x y Y \ y 1 μ T i μ 1 y i μ 2 i μ T

Late Fusion Input image 1-1 Scoring each bounding boxes by RCNN Extract region proposals Compute CNN features fc7 Multiclass PA for class 1 Multiclass PA for class j Multiclass PA for class 1000 Scoring regions by Multiclass PA for each class CNN S i,1 CNN S i,j CNN S i,1000 1-2 Scoring whole image by FV as contextual scores Whole image Extract FV with spacial information 2. Rescoring with combining RCNN feature and FV Multiclass PA for class 1 Multiclass PA for class j Multiclass PA for class 1000 Scoring by linear classifier trained by PA for each class S 1 FV S j FV FV S 1000 For bounding box i, class j, S new i,j = S CNN FV i,j S j

Results Method Localization error Classification error R-CNN feature + one-vs-all SVMs 0.631743 0.460080 R-CNN feature + multi-class PA 0.446121 0.285720 R-CNN feature + multi-class PA using hard negative classes Validation dataset 0.387516 0.227200 R-CNN feature + multi-class PA using hard negative classes, and FV 0.341743 0.18768 Test dataset Team name Localization error Classification error VGG 0.253231 0.07405 GoogLeNet 0.264414 0.14828 SYSU_Vision 0.31899 0.14446 MIL (our team) 0.337414 0.20734

Conclusion 1-1 Scoring each bounding boxes by RCNN Input image Extract region proposals Extract CNN features fc7 1-2 Scoring whole image by FV as contextual scores with hard negative mining Scoring regions by Late fusion Score Whole image Extract FV with spacial information Scoring whole image by Our pipeline R-CNN based region proposals and features with multi-class object detectors which create hard negative class for each positive class Global features (FVs) with multi-class online-learning Late fusion of region score and global score Combining R-CNN with the contextual information improves the localization performance. Multi-class object detector trained with hard negative classes outperforms one-vs.- the-rest SVMs.