Compressed Fisher vectors for LSVR

Similar documents
Fisher Vector image representation

Kai Yu NEC Laboratories America, Cupertino, California, USA

Image Classification with the Fisher Vector: Theory and Practice

MIL-UT at ILSVRC2014

Higher-order Statistical Modeling based Deep CNNs (Part-I)

Image Classification with the Fisher Vector: Theory and Practice

Towards Good Practices for Action Video Encoding

The state of the art and beyond

Image retrieval, vector quantization and nearest neighbor search

Representing Sets of Instances for Visual Recognition

FAemb: a function approximation-based embedding method for image retrieval

Image Classification with the Fisher Vector: Theory and Practice

Beyond Spatial Pyramids

Lecture 13 Visual recognition

Random Fourier Approximations for Skewed Multiplicative Histogram Kernels

Lie Algebrized Gaussians for Image Representation

38 1 Vol. 38, No ACTA AUTOMATICA SINICA January, Bag-of-phrases.. Image Representation Using Bag-of-phrases

Loss Functions and Optimization. Lecture 3-1

Wavelet-based Salient Points with Scale Information for Classification

Fine-grained Classification

Orientation covariant aggregation of local descriptors with embeddings

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

TRAFFIC SCENE RECOGNITION BASED ON DEEP CNN AND VLAD SPATIAL PYRAMIDS

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Chebyshev Approximations to the Histogram χ 2 Kernel

DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION

Urban land use information retrieval based on scene classification of Google Street View images

Composite Quantization for Approximate Nearest Neighbor Search

Fantope Regularization in Metric Learning

Learning to Rank and Quadratic Assignment

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. Department of Computer Science TU Darmstadt

Distinguish between different types of scenes. Matching human perception Understanding the environment

Improving Image Similarity With Vectors of Locally Aggregated Tensors (VLAT)

Maximally Stable Local Description for Scale Selection

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Kernel Density Topic Models: Visual Topics Without Visual Words

Loss Functions and Optimization. Lecture 3-1

EE 6882 Visual Search Engine

Randomized Algorithms

MULTIPLEKERNELLEARNING CSE902

ECS289: Scalable Machine Learning

Random Features for Large Scale Kernel Machines

Probabilistic Latent Semantic Analysis

Kernel Learning for Multi-modal Recognition Tasks

Supervised Quantization for Similarity Search

arxiv: v1 [cs.cv] 22 Dec 2015

Global Scene Representations. Tilke Judd

Feature detectors and descriptors. Fei-Fei Li

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Photo Tourism and im2gps: 3D Reconstruction and Geolocation of Internet Photo Collections Part II

FPGA Implementation of a HOG-based Pedestrian Recognition System

ECS289: Scalable Machine Learning

Feature detectors and descriptors. Fei-Fei Li

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers

Oceanic Scene Recognition Using Graph-of-Words (GoW)

Distinguishing Weather Phenomena from Bird Migration Patterns in Radar Imagery

TUTORIAL PART 1 Unsupervised Learning

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Human Action Recognition under Log-Euclidean Riemannian Metric

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition

Nearest Neighbor Classification Using Bottom-k Sketches

SVMs, Duality and the Kernel Trick

Optimized Product Quantization for Approximate Nearest Neighbor Search

2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Fastfood Approximating Kernel Expansions in Loglinear Time. Quoc Le, Tamas Sarlos, and Alex Smola Presenter: Shuai Zheng (Kyle)

Large-Scale Behavioral Targeting

Self-Adaptable Templates for Feature Coding

Optimized Product Quantization

Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Statistical Machine Learning

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Compression of Deep Neural Networks on the Fly

Achieving scale covariance

Pedestrian Density Estimation by a Weighted Bag of Visual Words Model

Deep Scene Image Classification with the MFAFVNet

Classification: The rest of the story

LoG Blob Finding and Scale. Scale Selection. Blobs (and scale selection) Achieving scale covariance. Blob detection in 2D. Blob detection in 2D

Academic Editors: Andreas Holzinger, Adom Giffin and Kevin H. Knuth Received: 26 February 2016; Accepted: 16 August 2016; Published: 24 August 2016

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

arxiv: v2 [cs.cv] 15 Jul 2018

Workshop on Web- scale Vision and Social Media ECCV 2012, Firenze, Italy October, 2012 Linearized Smooth Additive Classifiers

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

Linear and Non-Linear Dimensionality Reduction

Explaining Machine Learning Decisions

Detecting Humans via Their Pose

Multiclass Classification-1

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

Stochastic Generative Hashing

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

A Discriminatively Trained, Multiscale, Deformable Part Model

Improved Local Coordinate Coding using Local Tangents

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Kronecker Decomposition for Image Classification

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

For Review Only. Codebook-free Compact Descriptor for Scalable Visual Search

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Towards Semantic Embedding in Visual Vocabulary

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Transcription:

XRCE@ILSVRC2011 Compressed Fisher vectors for LSVR Florent Perronnin and Jorge Sánchez* Xerox Research Centre Europe (XRCE) *Now with CIII, Cordoba University, Argentina

Our system in a nutshell High-dimensional image signatures: Fisher Vectors (FV) Perronnin, Sánchez and Mensink, Improving the Fisher kernel for large-scale image classification, ECCV 10. + Compression: Product Quantization (PQ) Sánchez and Perronnin, High-dimensional signature compression for large-scale image classification, CVPR 11. + Simple machine learning: one-vs-all linear SVMs Linear classifiers learned in primal using Stochastic Gradient Descent (SGD) Page 2 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Our system in a nutshell High-dimensional image signatures: Fisher Vectors (FV) Perronnin, Sánchez and Mensink, Improving the Fisher kernel for large-scale image classification, ECCV 10. + Compression: Product Quantization (PQ) Sánchez and Perronnin, High-dimensional signature compression for large-scale image classification, CVPR 11. + Simple machine learning: one-vs-all linear SVMs Linear classifiers learned in primal using Stochastic Gradient Descent (SGD) Page 3 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

The need for fine-grained information As the number of classes increases we need finer-grained information. BOV answer to the problem: increase visual dictionary size. See e.g. Li, Crandall and Huttenlocher, Landmark classification in large-scale image collections, ICCV 09. High computational cost How to increase the image signature dimensionality without increasing the visual dictionary size? Embedding view of BOV: given a single local descriptor A possible solution: include higher-order statistics in embedding Fisher vector Page 4 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Fisher vector Patch embedding Notations: : set of D-dim local descriptors (e.g. SIFT) : GMM with parameters Fisher vector: with : A linear classifier on induces in the descriptor space BOV case: a piecewise constant decision function FV case: a piecewise quadratic decision function FV leads to more complex decision functions than BOV Perronnin and Dance, Fisher kernels on visual categories for image categorization, CVPR 07. See also: Csurka and Perronnin, An efficient approach to semantic segmentation, IJCV 10. Page 5 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Fisher vector Improvements FV normalization is crucial to get state-of-the-art results: Perronnin, Sánchez and Mensink, Improving the Fisher kernel for large-scale image classification, ECCV 10. Power normalization: variance stabilization of a compound Poisson See also: Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid, Aggregating local descriptors into compact codes, upcoming TPAMI. L2 normalization: invariance to amount of background Spatial pyramids: Final representation = concatenation of per-region Fisher vectors Let us denote: D = feature dim, N = # Gaussians and R = # regions FV dim = 2DNR, e.g. 2 x 64 x 1,024 x 4 = 0.5M dimensions large (informative) signatures even for small vocabularies See also: Chatfield, Lempitsky, Vedaldi, Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC 11. Page 6 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

The need for HD signatures ILSVRC2010: use subset of classes and vary # Gaussians N HD signatures make a difference for large number of classes See also: Tsuda, Kawanabe and Müller, Clustering with the Fisher score, NIPS 02. Page 7 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

The need for compression The FV is high dimensional (e.g. 0.5M dim) and weakly sparse (only half of the dimensions are non-zero) Very high storage / memory / IO cost Example using 4-byte floating point arithmetic: ILSVRC 2010 2.8TBs ImageNet 23TBs per low-level descriptor type We explored two forms of compression: dimensionality reduction: dense and sparse reduces storage and learning cost coding: product quantization reduces storage only Page 8 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Dimensionality reduction Dense projections: e.g. PCA, Gaussian Random Projections, PLS, etc. cost linear in the # of input and output dim useful only if we can keep a very small # of output dimensions Huge accuracy drop for reasonable computational cost Sparse projections: random aggregation with the Hash Kernel (HK) Shi, Petterson, Dror, Langford, Smola, Strehl, Vishwanathan, Hash kernels, AISTATS 09. Weinberger, Dasgupta, Langford, Smola and Attenberg, Feature hashing for large scale multitask learning, ICML 09. Results on ILSVRC 2010: Page 9 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Coding with Product Quantization (PQ) Principle Split vector into (small) sub-vectors of G dimensions Training step: k-means clustering for each sub-vector (2 K centroids) b = K/G bits per dimension Compression step: encode each subvector by its closest centroid index FV encoded as a vector of indices Jégou, Douze and Schmid, Product quantization for nearest neighbor search, TPAMI 11. The larger b and G, the higher the accuracy... and the higher the computational/storage cost which is in O(2 bg ) fix b according to storage budget and G according to computational budget Page 10 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Coding with PQ Learning Learning is done in the space of FVs. Using Stochastic Gradient Descent (SGD) to learn linear SVMs: one sample is decompressed (look-up table access) it is fed to the SGD learning algorithm the decompressed sample is discarded only one uncompressed sample is alive at any time in RAM No compression needed at test time. For reference SGD implementation: see Bottou s webpage http://leon.bottou.org/projects/sgd Page 11 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Coding with PQ Results on ILSVRC 2010 If one has a large storage budget (large b), scalar quantization (G=1) is sufficient G has a large impact for a small number of bits b PQ performs significantly better than dim-reduction for any compression rate Page 12 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Summary Low-level feature extraction 10k patches per image SIFT: 128-dim reduced to 64-dim with PCA color: 96-dim FV extraction and compression: N=1,024 Gaussians, R=4 regions 520K dim x 2 compression: G=8, b=1 bit per dimension One-vs-all SVM learning with SGD Late fusion of SIFT and color systems For details, see: Sánchez and Perronnin, High-dimensional signature compression for large-scale image classification, CVPR 11. Page 13 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Results on ILSVRC 2010 Challenge winners: combination of 6 systems 71.8% sparse coding and super-vector coding (very close to FVs) HOG and LBP features various codebooks and spatial pyramids Lin, Lv, Zhu, Yang, Cour, Yu, Cao and Huang, Large-scale image classification: fast feature extraction and SVM training, CVPR 11. Our results: 74.3% 1 week on a single 16 CPUs machine Can we get reasonable results with fewer dimensions? Yes! Concatenation of SIFT FV + color FV (N=256, no SP) 65K dimensions + PQ compression + linear SVMs: 71.4% training set fits in 3.4GB!!! Page 14 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.

Can we handle larger datasets? ImageNet10K: 10,184 classes (leaves and internal nodes) 9M images: ½ training and ½ test accuracy measured as % top-1 correct SIFT + BOV + SP + fast IKSVM: 6.4% Deng, Berg, Li, Fei-Fei, What does classifying more than 10,000 image categories tell us?, ECCV 10. SIFT + FV (N=256) + SP (R=4) 130K-dim image signatures + PQ compression + linear SVM: 16.7% For details, see: Sánchez and Perronnin, High-dimensional signature compression for large-scale image classification, CVPR 11. Page 15 F. Perronnin, J. Sánchez, Compressed Fisher vectors for LSVRC, PASCAL VOC / ImageNet workshop, ICCV, 2011.