RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Similar documents
Maxout Networks. Hien Quoc Dang

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Jakub Hajic Artificial Intelligence Seminar I

Feature Design. Feature Design. Feature Design. & Deep Learning

Convolutional Neural Network Architecture

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Learning (CNNs)

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Modularity Matters: Learning Invariant Relational Reasoning Tasks

Spatial Transformer Networks

A practical theory for designing very deep convolutional neural networks. Xudong Cao.

σ(x) = clip( x m + α 2, 0, α) = min(max( x m + α 2, 0), α) (1)

Restricted Boltzmann Machines

Bayesian Networks (Part I)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Understanding How ConvNets See

Back to the future: Radial Basis Function networks revisited

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Denoising Autoencoders

Deep learning attracts lots of attention.

Learning Deep Architectures

Convolutional Neural Networks. Srikumar Ramalingam

Lecture 17: Neural Networks and Deep Learning

Deep Learning: a gentle introduction

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Fine-grained Classification

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Adversarial Examples Generation and Defense Based on Generative Adversarial Network

Introduction to Convolutional Neural Networks (CNNs)

arxiv: v1 [cs.cv] 21 Jul 2017

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Theories of Deep Learning

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

The connection of dropout and Bayesian statistics

FreezeOut: Accelerate Training by Progressively Freezing Layers

Deep Learning Year in Review 2016: Computer Vision Perspective

Abstention Protocol for Accuracy and Speed

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Based on the original slides of Hung-yi Lee

CSC321 Lecture 16: ResNets and Attention

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

ON THE STABILITY OF DEEP NETWORKS

Holographic Feature Representations of Deep Networks

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

From perceptrons to word embeddings. Simon Šuster University of Groningen

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

arxiv: v3 [cs.ne] 10 Mar 2017

Classification of Hand-Written Digits Using Scattering Convolutional Network

Knowledge Extraction from Deep Belief Networks for Images

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

Introduction to Deep Learning

Analysis of Feature Maps Selection in Supervised Learning Using Convolutional Neural Networks

Importance Reweighting Using Adversarial-Collaborative Training

Flower species identification using deep convolutional neural networks

Introduction to Convolutional Neural Networks 2018 / 02 / 23

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Deep Convolutional Neural Networks for Pairwise Causality

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Lecture 12

Towards Accurate Binary Convolutional Neural Network

Normalization Techniques in Training of Deep Neural Networks

Supervised Deep Learning

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

Group Equivariant Convolutional Networks. Taco Cohen

SGD and Deep Learning

ENSEMBLE METHODS AS A DEFENSE TO ADVERSAR-

Deep learning on 3D geometries. Hope Yao Design Informatics Lab Department of Mechanical and Aerospace Engineering

How to do backpropagation in a brain

Gaussian Cardinality Restricted Boltzmann Machines

Convolutional Neural Networks

Notes on Adversarial Examples

Neural networks and optimization

Introduction to Deep Neural Networks

arxiv: v4 [cs.lg] 28 Jan 2014

Introduction to Deep Learning

Grundlagen der Künstlichen Intelligenz

Improved Local Coordinate Coding using Local Tangents

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Data Analysis Project

TUTORIAL PART 1 Unsupervised Learning

Adversarial Image Perturbation for Privacy Protection A Game Theory Perspective Supplementary Materials

Spatial Transformation

Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers

Understanding CNNs using visualisation and transformation analysis

Local Affine Approximators for Improving Knowledge Transfer

Transcription:

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

SIFT HOG ALL ABOUT THE FEATURES DAISY GABOR

AlexNet GoogleNet CONVOLUTIONAL NEURAL NETWORKS VGG-19 ResNet

FEATURES COMES FROM DATA PCA Dictionaries Neural Networks We know/posit that these features are the best representations of data for the dataset that we are currently concerned about. What about off-the-shelf neural features?

OFF-THE-SHELF FEATURES Extract features from a large dataset such as ImageNet Make the feed forward network public. Download off-the-shelf networks. Extract features on user dataset. Train a new classifer on top.

OFF-THE-SHELF FEATURES Extract features from a large dataset such as ImageNet Make the feed forward network public. Download off-the-shelf networks. Extract features on user dataset. Train a new classifier on top. Is there a guarantee that the downloaded features can capture the intricacies and idiosyncrasies of the user dataset?

OFF-THE-SHELF FEATURES Most often not! But we roll with it. Because it works. Is there a guarantee that the downloaded features can capture the intricacies and idiosyncrasies of the user dataset?

The ubiquity of downloaded CNNs Unquestioned performance of networks trained on ImageNet One network fits all. But does it?

ATOMIC STRUCTURES CNN filters take some shapes due to the entropy of the dataset. Some datasets have some unique idiosyncrasies that show up as atomic structures. These may be edges and Gabor filters in the first layers and so on.

S D 2 D 1 D 4 D 3 D 5

A GENERALITY RANKING METRIC Generality is not a rankable concept. Due to the overlapping nature of feature expressions, representations aren t usually nestable or complete. Generality is only a relative concept. Can we use the neural training procedure and dataset performance to measure dataset generality? - Yes. Very close corollary to network transferability and remembrance. [1]. [1] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, How transferable are features in deep neural networks?, in Advances in Neural Information Processing Systems, 2014, pp. 3320 3328.

DETOUR. OBSTINATE LEARNING. An obstinate layer is a layer whose weights are not allowed to update during training. Gradients are simply ignored. An obstinate layer and all the layers that feeds into the obstinate layers must all be frozen. Downloading a network and training only the softmax layer. Layer-wise pre-training. Dropouts (not exactly but similar).

EXPERIMENT SETUP 1. Consider two datasets D " and D #. 2. Initialize a network with random weights and train with D %. This network is called the base network and is represented by n D % r. n 3. Retrain n D % r to produce n ( D ) D %. " D ) D % n # D ) D % n * D ) D % n ( D ) D % 0 0 0 0 1 1 1 1 9 9 9 9

GENERALITY METRIC Performance of n D ) r is Ψ D ) r. Performance of n ( D ) D % is Ψ ( D ) D %. Dataset generality of D % with respect to D ) at layer k is: g ( (D %, D ) ) = Ψ ( D ) D % Ψ D ) r

Performance that is achieved by D ) using, N k layers worth of prejudice from D % k layers worth of features from D % k layers of novel knowledge from D ) g ( (D %, D ) ) = Ψ ( D ) D % Ψ D ) r

PROPERTIES OF THIS GENERALITY METRIC g ( D %, D ) > g ( D %, D 6 at k layers, D % provides more general features to D ) than to D 6. Conversely, when initialized by n D % r, D ) has an advantage in learning than D 6. g ( D %, D % 1 k. g ( D %, D ) for i j might or might not be greater than 1.

DATASETS CONSIDERED

REFERENCES TO DATASETS 1. Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 473 480. - MNIST and all its rotations. 2. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng, Reading digits in natural images with unsupervised feature learning, in NIPS workshop on deep learning and unsupervised feature learning. Granada, Spain, 2011, vol. 2011, p. 5. 3. T. E. de Campos, B. R. Babu, and M. Varma, Character recognition in natural images, in Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009. 4. Alex Krizhevsky and Geoffrey Hinton, Learning multiple layers of features from tiny images, 2009. 5. Li Fei-Fei, Rob Fergus, and Pietro Perona, Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories, Computer Vision and Image Understanding, vol. 106, no. 1, pp. 59 70, 2007.

SOME INTERESTING RESULTS

1.1 Retrained CALTECH101 for different bases 1.2 Retrainedchar74-kannada for different bases 1.02 Retrainedmnist for different bases 1 1 1 0.9 0.8 0.7 0.6 0.5 CALTECH101-base 0.4 CALTECH101 CIFAR10 0.3 0 1 2 3 4 5 6 Caltech 101 vs. Colonoscopy 1.6 1.4 Generality 0.8 0.6 char74-kannada MNIST random background 0.4 MNIST MNIST rotated background NIST Special Dataset-19 0.2 Google Street View House Numbers Char 74k English Char 74k Kannada 0 0 1 2 3 Number of Layers Unfrozen 1.1 1 Retrainedmnist-bg-rand for different bases Generality 0.96 mnist MNIST random background 0.94 MNIST MNIST rotated background NIST Special Dataset-19 0.92 Google Street View House Numbers Char 74k English Char 74k Kannada 0.9 0 1 2 3 Number of Layers Unfrozen 1.2 0.98 Retrainedsvhn for different bases 1.2 0.9 1 Generalities 1 0.8 0.6 caltech 101 on caltech 101 0.4 caltech 101 on colonoscopy colonsoopy on colonoscopy colonoscopy on caltech101 0.2 0 1 2 3 4 5 6 Retrainedchar74-english for different bases 1 0.9 0.8 Generality 0.8 mnist-bg-rand 0.7 MNIST random background MNIST 0.6 MNIST rotated background NIST Special Dataset-19 0.5 Google Street View House Numbers Char 74k English Char 74k Kannada 0.4 0 1 2 3 Number of Layers Unfrozen 1.1 1 0.9 Retrainedmnist-rotated-bg for different bases Generality 0.8 0.6 svhn MNIST random background 0.4 MNIST MNIST rotated background NIST Special Dataset-19 0.2 Google Street View House Numbers Char 74k English Char 74k Kannada 0 0 1 2 3 Number of Layers Unfrozen Generality 0.7 0.6 char74-english 0.5 MNIST random background MNIST 0.4 MNIST rotated background NIST Special Dataset-19 0.3 Google Street View House Numbers Char 74k English Char 74k Kannada 0.2 0 1 2 3 Number of Layers Unfrozen Generality 0.8 0.7 mnist-rotated-bg 0.6 MNIST random background MNIST MNIST rotated background 0.5 NIST Special Dataset-19 0.4 Google Street View House Numbers Char 74k English Char 74k Kannada 0.3 0 1 2 3 Number of Layers Unfrozen Generality curves for each dataset as base against all the other comparing datasets.

PARSING THE GRAPHS SOME SURPRISING RESULTS No dataset is qualitatively the most general. MNIST dataset is the most specific. Rather, MNIST dataset is one that is generalized by all datasets very highly at all layers. MNIST dataset actually gives better accuracy when prejudiced with other datasets than with random intits or even when prejudiced with itself!! This is a strong indicator that all datasets contain all atomic structures of MNIST. English and Digits are more general than Kannada!! While MNIST and MNISt-rotated are not general, other MNIST with backgrounds, Google SVHN, NIST and Char74-English are all more general than Char74-Kannada.

INTER-CLASS DATASET GENERALITY D % and D ) need not be entire datasets but can also be just disjoint class instances of the same dataset. For instance, we divided the MNIST dataset into two parts. MNIST [4, 5, 8] (base) and MNIST [0, 1, 2, 3, 6, 7, 9] (retrain). Repeated this experiment several times with decreasing number of training samples perclass in the retrain dataset of MNIST [0, 1, 2, 3, 6, 7, 9]. The testing set remained the same size. We created seven such datasets with 7p, p [1, 3, 5, 10, 20, 30, 50] samples each.

INTRA-CLASS GENERALITY - RESULTS Generality 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 Initializing a network that was trained on only a small sub-set of well-chosen classes can significantly improve generalization performance on all classes. Even if trained with arbitrarily few samples. Even at the extreme case of one-shot learning. Sub-sampled generalities Randomly Initialized n = 1 n = 3 n = 5 n = 10 n = 20 n = 30 n = 50 1 0 1 2 3 Number of Layers Frozen p base k =0 k =1 k =2 k =3 1 Random - - - 55.61 MNIST[458] 73.07 73.91 76.37 77.52 3 Random - - - 73.34 MNIST[458] 83.61 87.2 85.7 87.6 5 Random - - - 83.32 MNIST[458] 90.98 92.98 92.6 92.07 10 Random - - - 81.31 MNIST[458] 91.55 93.71 93.82 95.08 20 Random - - - 87.77 MNIST[458] 95.52 95.52 97.07 96.78 30 Random - - - 88.62 MNIST[458] 96.5 97.34 97.35 97.45 50 Random - - - 90.78 MNIST[458] 96.38 97.40 97.71 97.38

INTRA-CLASS GENERALITY - RESULTS p base k =0 k =1 k =2 k =3 1 Random - - - 55.61 MNIST[458] 73.07 73.91 76.37 77.52 3 Random - - - 73.34 MNIST[458] 83.61 87.2 85.7 87.6 5 Random - - - 83.32 MNIST[458] 90.98 92.98 92.6 92.07 10 Random - - - 81.31 MNIST[458] 91.55 93.71 93.82 95.08 20 Random - - - 87.77 MNIST[458] 95.52 95.52 97.07 96.78 30 Random - - - 88.62 MNIST[458] 96.5 97.34 97.35 97.45 50 Random - - - 90.78 MNIST[458] 96.38 97.40 97.71 97.38 Even with one sample per class, a 7-way classifier could achieve 22% more accuracy than a randomly initialized network. It is note worthy that the last row of table still has 100 times less data than the full dataset and it already achieves close to state-of-the-art accuracy even when no layer is allowed to change.

MORE RESULTS Once initialized with a general enough subset of classes from within the same dataset, the generalities didn t vary among the layers. The more the data we used, more stable the generalities remained. If the classes are general enough, one may initialize the network with only those classes and then learn the rest of the dataset even with very small number of samples.

This work was supported in part by ARO grant W911NF1410371. Fin.