Semantic Relatedness in Convolutional Neural Networks

Size: px

Start display at page:

Download "Semantic Relatedness in Convolutional Neural Networks"

Amy Alexander
5 years ago
Views:

Semantic Relatedness in Convolutional Neural Networks Paul Missault Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Ir. Rein Houthooft, Dr.

1 Semantic Relatedness in Convolutional Neural Networks Paul Missault Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Ir. Rein Houthooft, Dr. Stijn Verstichel Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Information Technology Chair: Prof. dr. ir. Daniël De Zutter Faculty of Engineering and Architecture Academic year

2 2

3 Semantic Relatedness in Convolutional Neural Networks Paul Missault, Gent, Belgium. Supervisor(s): Filip De Turck, Femke Ongenae Abstract The work in this article will introduce the semantic cross entropy, a novel objective function with interesting properties. It will be shown that when this semantically aware objective function is used to train deep networks the resulting output confidences are very well calibrated. When such a network predicts a class with a high confidence, there is in fact a high probability that the prediction is correct, conversely a prediction with a low confidence has a high probability of being wrong. This is opposed to traditionally trained networks that are typically extremely confident in all their predictions, regardless of the actual probability of being correct. Furthermore, with an appropriate choice of semantic relations among your dataset, the objective will correlate the classification errors more strongly with the ground truth, i.e. the errors made by a classifier will be semantically closer to the ground truth. The impact of this novel objective will be evaluated on Convolutional Neural Networks (CNNs). Keywords Semantics, Convolutional Neural Network, Deep Learning, Computer Vision I. INTRODUCTION THE research within computer vision traditionally considers classification errors in a binary manner, a classification is correct or it is not. Furthermore, a classifier is always trained to achieve the highest possible confidence in the ground truth. Both these statements follow from the most widely used classification objective, the cross entropy which is shown in Equation 1. In this equation (X n, Y n ) is a datapoint from the training set (X, Y ), P (Y n X n, θ) is therefore the classifiers confidence in the ground truth Y n given classifier parameters θ and input X n. C(θ, X, Y ) = 1 N N 1 n=0 log (P (Y n X n, θ)) (1) A classifier trained to minimize this objective will push its confidence in the ground truth as close to 1 as possible for all datapoints in the training set. The output probabilities of these networks are therefore very spiked, a property that is not only present during training but that also holds at test time as will be shown in what follows. This paper proposes the following novel objective function, the semantic cross entropy built on the work of Zhao et al [1] C(θ, X, Y ) = 1 N N 1 M 1 n=0 m=0 S Yn,L m log (P (L m X n, θ)) (2) The values of S i,j describe the semantic relatedness of labels i and j. The inner sum of Equation 2 is therefore a sum over all the possible labels, where the (logarithm of) the confidence in each label is weighed by that labels relatedness to the ground truth. If S is the eye matrix the inner sum will only be nonzero when L m = Y n which reverts the semantic cross entropy back to the general cross entropy. Such a choice for S assumes a semantic relatedness where a concept is only related to itself as S i,j = 0 for i j. Perhaps a clearer way to think about the proposed function is that the value of S Yn,L m allows a network to be somewhat confident in a label L m that is not the ground truth Y n provided that label is semantically related to the ground truth. II. CROSS ENTROPY GENERATES OVERLY CONFIDENT CLASSIFIERS TO show the claim that output of classifiers trained with cross entropy are spiked even at test time we built a network heavily inspired by the current best classifier on the CIFAR-100 dataset as designed by Clevert et al. [2]. This classifier is a CNN with the architecture as described in Table I. table layer filters filtersize convolutional 384 3x3 convolutional 384 1x1 MaxPool 2x2 convolutional 384 1x1 convolutional 480 3x3 convolutional 480 3x3 MaxPool 2x2 convolutional 480 1x1 convolutional 520 3x3 convolutional 520 3x3 MaxPool 2x2 convolutional 540 1x1 convolutional 560 3x3 convolutional 560 3x3 MaxPool 2x2 convolutional 560 1x1 convolutional 600 3x3 convolutional 600 3x3 MaxPool 2x2 convolutional 600 1x1 softmax 100 TABLE I: Network architecture The architecture described in Table I will be used throughout the rest of this paper. The activation function used in all layers except the final softmax is the Exponential Linear Unit (ELU), as described by Clevert et al. Dropout [3] is applied to the output of the last convolutional layer, and the output of every MaxPool above it with respective dropout rates of [0.5, 0.4, 0.3, 0.2, 0.2, 0]. The size of the filters is slightly differ-

4 ent as proposed in [2]. Uneven filters can (with proper padding) preserve the size of the input. The proposed filter 2x2 filter sizes require either complex padding schemes or systematic upsampling of the output. This CNN was trained on CIFAR-100 [4], a dataset of labeled tiny images with a fixed train-test split of and images. It was trained for 80 epochs using stochastic gradient descent with a batch size of 100. The initial learning 1 rate was set at 0.01 and decayed by a factor 1.05 after every epoch. Momentum was used with a parameter of 0.9, as was L2 weight decay with strength of This trained network was evaluated on the fixed set of test images on which it achieved an accuracy of 69.24%. Statistics of the prediction confidence are summarized in Table II. statistic mean median std TABLE II: Statistics of the prediction confidence on the testset table These statistics do not tell the full story. Despite being a clear indication that the prediction confidence is typically close to 1, we can not decisively conclude whether or not this network is overly confident. It is still very well possible that this network is very confident in those prediction that are correct but still has a lower confidence in all the predictions that ended up being wrong. To further examine this case all predictions were grouped by a threshold on their prediction confidence. On each of these groups the prediction accuracy is calculated once more, summarized in Table III. threshold # images accuracy none mean (=0.912) <mean TABLE III: Accuracy for varying thresholds on prediction confidence table In an ideal case we would want the accuracy in each of the groups to be higher than their threshold, e.g. when we consider only predictions where the confidence is higher than 0.9 we would like at least 90% of those prediction to be correct. From Table III we see that this is not the case in a traditionally trained network. Out of all the predictions with a confidence higher than 0.99, only 86% is correct. Secondly we also see that 6299 of the images are predicted with a confidence higher than This leads us to the conclusion that cross entropy generates overly confident networks. The confidence in a prediction is very hard to relate to an actual chance of that the prediction is correct. This is especially an issue in a detection use case, as these networks will generate a lot of false positives ( The network has more than 0.99 confidence that a I can find waldo in this spot, therefore it surely is correct to assume there is a waldo. ). This is further explored and visualized in Figure 1. actual accuracy ## = amount of samples in bin confidence threshold Fig. 1: Accuracy within confidence intervals for network trained with cross entropy figure Again we see 7632 out of predictions have a confidence 0.9, showing that cross entropy generates very confident networks. The resulting acutal accuracy for each of these predictions is significantly lower than the prediction confidence. The network is therefore not only very confident it is in fact overly confident. III. SEMANTICS IN SEMANTIC CROSS ENTROPY FROM the definition of Semantic Cross Entropy in Equation 2 we see that the relatedness among labels is governed by a matrix S. The introduction already mentioned that when this S matrix is an eye-matrix, a label is only related to itself resulting in a Semantic Cross Entropy that is equivalent to Cross Entropy. Generally this matrix can be any symmetric matrix as semantic relatedness is a symmetric property (i relates to j just as much as j relates to i). Secondly the sum of the elements of a row (equivalently column) should be constant. If we let the sum of row (column) i be greater than the sum of row (column) j we effectively unbalance our objective in favor of label i. This behavior is often used in datasets where label i is underrepresented, but should be avoided in the general case as each label is equally important. The constant is assumed to be 1 without loss of generatlity. Continuing on the work of Zhao et al. [1] we propose a general method to select a matrix S for any dataset of which the labels are structured in a hierarchy or an ontology. We begin by defining a matrix D where the element D i,j is a quantitative measure of the hierarchical or ontological distance of labels i and j. Consider for instance following D i,j for labels structured in a hierarchic tree. D i,j = length(path(i) path(j)) max(length(path(i)), length(path(j))) path(i) defines the path from the root node (the base class) down the hierarchy to class i, while p 1 p 2 defines the number (3)

5 of classes that are part of both paths p 1 and p 2. This very general construction of D will work when the labels structured in a hierarchical tree, but any hierarchical or ontological distance measure can be used to construct D. From the matrix D when then construct S as: S i,j = 1 Z exp( κ(1 D i,j)) (4) Where κ is a hyper-parameter that governs the decay of the relatedness and Z a normalization constant such that the rows of S sum to 1. The impact of κ is explored in Figure 2. D is built according to Equation 3 using the labels of CIFAR-100 structured hierarchically according to Figure 7. IV. IMPACT OF SEMANTIC CROSS ENTROPY TO research the impact of semantic cross entropy we evaluate its performance on CIFAR-100 using the same network architecture and training parameters as discussed in Section II. We train this architecture three times, with three different choices of S. A semantic choice of S with κ = 8 where D is built according to Equation 3 using the CIFAR hierarchy depicted in to Figure 7. The non-semantic choice of S is that where S is an eye-matrix. As then S i,j = 0 for i j the semantic cross entropy reverts to cross entropy. The network trained with this choice of S is therefore completely equivalent to the network discussed in section II trained with cross entropy. The third choice is the uniform S where α = 0.9. The test error is shown over 80 training epochs for these networks in Figures 3 and semantic uniform non-semantic error figure Fig. 2: Impact of κ on the S matrix Studying the diagonal of the S matrices can help select an appropriate κ. For κ = 2 we see that the elements on the diagonal are approximately Such a choice for kappa would result in an objective function that depends for only 6% on its confidence in the ground truth. Inversely this means 94% of the objective is determined by the confidence is related labels, which is typically not what we want. Selecting κ = 8 results in an S matrix with 0.9 on the diagonal. Such an S matrix will have more use cases as now the objective is still dominated by the ground truth but the other 10% of the objective will depend on how confident the network is in semantically related labels. A second approach can be taken where we neglect the actual semantics of a dataset and superimpose a very simple, uniform relatedness. For a dataset of N possible labels we impose that a label is related to itself with strength α and related to others as 1 α N 1. On a dataset of 100 labels such as CIFAR and an α = 0.9 this would translate to an S matrix with 0.9 on the diagonal and off the diagonal S = This choice of S shall henceforth be called the uniform S. figure error epoch Fig. 3: Test error rate shown for 80 epochs figure epoch semantic uniform non-semantic Fig. 4: Test error rate for the last 30 epochs non-semantic semantic uniform lowest error highest accuracy 69.24% 68.38% 69.98% TABLE IV: Lowest error or equivalently highest accuracy for three networks table The results from Figure 3, summarized in Table IV are quite

6 surprising. Not only does the introduction of a uniform S cause the network to train faster, it also lowers the error by or equivalently increases accuracy by 0.74%. The prediction confidences of the semantic and uniform networks are analyzed in the same way as we already did for the non-semantic one in Section II, the results of which are shown in Figures 5 and 6 actual accuracy ## = amount of samples in bin confidence threshold Fig. 5: Accuracy for varying thresholds with semantic S figure actual accuracy ## = amount of samples in bin confidence threshold Fig. 6: Accuracy for varying thresholds with uniform S figure These results show that the confidence of a network trained with semantic cross entropy is more tightly bound with the actual chance that the predicted label is correct. Whereas the prediction confidence of a network trained with the non-semantic cross entropy is typically high, this confidence does not properly reflect whether we can actually trust the networks prediction. Should a network trained with semantic cross-entropy exhibit a high confidence we can be quite sure that the prediction is in fact correct. We will informally refer to this well calibrated prediction confidence as the trustworthiness of the network. The impact semantic cross entropy has on the trustworthiness of a network seems to be similar for both uniform and semantically inspired S. This remarkable quality makes it so trustworthy predictions can be generated for arbitrary datasets regardless of semantic relatedness of the labels in that dataset. We only have to decide on a fitting choice for α in S. An additional benefit of the semantically inspired S matrix becomes apparent when we evaluate the errors made by the three networks. More specifically we are interested in the parent accuracy, the percentage of errors that still have the same parent in the hierarchy of Figure 7 as the ground truth. E.g. if the ground truth label of an input is the label beaver a prediction of seal would be an error, but as they both have the parent label aquatic animals they are parent accurate. non-semantic semantic uniform parent accuracy 28.13% 35.46% 33.59% TABLE V: Parent accuracy evaluated on errors made by all three networks table Table V reveals that the errors made by a network trained using semantic cross entropy with a semantically inspired S makes errors that are semantically related to the ground truth. Should such a network make an error, there is still a 35.46% chance that the error has the same hierarchical parent as the ground truth. Remarkably the network trained with a uniform S also outperforms the non-semantic network on this metric despite neither having knowledge about the hierarchy. V. CONCLUSIONS ON the CIFAR-100 dataset a network trained using semantic cross entropy with a uniform S achieved an accuracy of 69.98% while a non-semantically trained network reached 69.24%. Whether or not this improvement can continuously be noticed is left to future work, but it is a clear indication that semantic cross entropy with a uniform S can be introduced without lowering the overall accuracy. The network with a semantically inspired S reached an accuracy of 68.38%, but of all those errors 35.46% still have the same hierarchic parent as the ground truth. This is only true for 28.13% of the errors made by the non-semantic network. Arguably the most beneficial aspect of semantic cross entropy is the calibration of output confidences. Both the networks trained with semantically inspired S and uniform S generate predictions of which the confidence is closely related to a probability that this prediction is correct. Provided this property can be contiously shown on other datasets, such networks could have many use cases. Most noticably these network would drastically decrease the amount of false positives in a detection task. REFERENCES [1] Bin Zhao, Li Fei-Fei, and Eric P Xing, Large-Scale Category Structure Aware Image Categorization, Advances in Neural Information Processing Systems 24 (Proceedings of NIPS), pp. 1 9, [2] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), Under review of ICLR2016 ELU,, no. 1997, pp. 1 13, [3] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov, Improving neural networks by preventing coadaptation of feature detectors, arxiv: , pp. 1 18, [4] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images,... Science Department, University of Toronto, Tech...., pp. 1 60, 2009.

7 figure Fig. 7: Hierarchy of labels in CIFAR

8 vi

9 Acknowledgements A big thanks goes out to Kristof and Rein for their efforts in proofreading this thesis. Their insightful comments have shaped many sections and paragraphs. I d like to thank Brecht, Femke and Stijn for their role in providing me with the necessary hardware both before and after the relocation of the iminds Virtual Wall. Femke and Stijn have also tremendously helped me with the final structure of chapters which has lead to -what I believe- a sequential story. Special thanks goes out to Kristof and Tom who personally helped me deal with floods due to heavy rains in the last few days of writing. Without their selfless efforts I wouldn t have been able to get my final revisions done in time. vii

10 viii

11 Contents 1 Introduction 1 2 Convolutional Neural Networks Artificial Neural Networks Neural networks and supervised learning A brief history of Convolutional Neural Networks for computer vision The architecture of Convolutional Neural Networks The convolutional layer The pooling layer Activation Function Training Convolutional Neural Networks Objective Functions Gradient Descent Backpropagation Getting more out of Convolutional Neural Networks Stochastic gradient descent Data Augmentation Regularizing Convolutional Neural Networks The Convolutional Layer Dropout Objective Function: Design and Adaptation Discriminative Categories Hedging Your Bets Semantic cost Semantic Cross Entropy Semantic relatedness matrix Example use Research questions ix

12 x CONTENTS 4 Research Strategy Hardware Setup Dataset selection Belgian Traffic Signs ImageNet CIFAR Network Architecture and Training Specification Overcoming numerical instability Tier-n Accuracy Evaluation and Results Selecting κ Network Accuracy Network Confidence Semantic cross entropy as regularisation Conclusion and Future Work Conclusions Future Work Does semantic cross entropy with a uniform S continuously outperform non-semantic cross entropy in terms of accuracy? Semantic Cross Entropy in Detection Erroneous Truth Labelling Evaluation on Complex Datasets

13 List of Figures 2.1 ANN with 1 hidden layer Convolution taken over an image Gabor filters Example result of convolving the image on the left with a set of Gabor filters First layer filters trained on natural images Locally connected neurons Stride of filters Effects of a 2-by-2 maxpool layer commonly used activation functions ReLu(x) = max(0,x) Variants on RELU with α 1 = 0.1 and α 2 = Saddle point Image of a rottweiler Example from the traffic sign dataset Example from ImageNet. A few of its labels are Panda, Giant Panda, Mammal and Vertebrate Example from CIFAR, 32x32 tiny color image labelled as dog Tree of related labels in CIFAR Visualization of the effect of κ on the S matrix Error rates of finetuning a network with various values for κ Zoom on the last epochs of Figure Test error rate shown for 80 epochs Test error rate for the last 30 epochs Non-semantic cross entropy for 80 epochs Accuracy for specific confidence intervals for semantically trained network Accuracy for specific confidence intervals for non-semantically trained network Test error rate shown for 80 epochs Test error rate for the last 30 epochs xi

14 xii LIST OF FIGURES 5.11 Accuracy for specific confidence intervals for the network trained with uniform S

15 List of Tables 3.1 prediction results of rottweiler image semantic and non-semantic cross entropy Hardware specifications Network architecture Lowest error or equivalently highest accuracy for both networks Tier-1 accuracy evaluated on tier-0 errors made by both networks Tier-1 accuracy evaluated on all predictions made by both networks statistics of the highest confidence on each image in the testset accuracy for varying thresholds with semantic cross entropy accuracy for varying thresholds with non-semantic cross entropy Lowest error or equivalently highest accuracy for three networks Tier-1 accuracy evaluated on tier-0 errors made by all three networks xiii

16 xiv LIST OF TABLES

17 Acronyms ANN Articifial Neural Network. 1, 2, 5 10, 16 18, 20 CNN Convolutional Neural Network. i, ii, 1, 2, 7 12, 14 16, 18, 19, 22, 26 28, 34, 37, 39, 40, 61 DNN Deep Neural Network. 1, 2, 19, 22 ELU Exponential Linear Unit. 17, 18, 42 LReLU Leaky Rectified Linear Unit. 17, 18 MSE Mean Squared Error. 19, 20 ReLU Rectified Linear Unit , 25 SVM Support Vector Machine. 8, 31 xv

18 xvi Acronyms

19 Chapter 1 Introduction Research concerning deep learning has become increasingly popular over the past years. In the year I took to write this thesis Google beat the world champion Go by revolutionizing reinforcement learning, NVIDIA taught a computer how to drive a car using only Convolutional Neural Networks (CNNs) and both Microsoft and Apple considerably advanced speech recognition. The amount of papers published on every aspect of deep learning is quite overwhelming and leads to quotes like... to our knowledge we are the first to... in papers by even the most highly regarded research groups. In order to therefore fully define where this thesis fits in the whole deep learning landscape I will first attempt to define deep learning. Deep learning is a subset of machine learning algorithms that consist of several modules that extract features from the data. Each module uses the output of the previous module in an hierarchical way, resulting in high-level features of the input data. These features can then be used by subsequent modules to perform the required task. In the case of Deep Neural Networks (DNNs) every such module is a set of neurons, which we call a layer. The distinction between traditional Articifial Neural Networks (ANNs) and a DNN is not defined by the amount of layers, but rather by the role of these layers. Every layer in a DNN transforms the output of the previous layer into some higher level features of the data, whereas a layer in an ANNs is typically regarded to perform a (possibly complex) task on its input. Perhaps the most widespread type of a DNN is a CNN. These CNNs are highly structured and have arguably had the most impact in deep learning to date. The true power of CNNs is most clear in the field of computer vision. This power has only recently been fully discovered due to the increased availability of high-resolution image datasets and increases in both computational power 1

20 2 CHAPTER 1. INTRODUCTION and computationally efficient libraries. For example, the ImageNet Large Scale Visual Recognition Challenge 2012 required classifying 150,000 images using over 10 million labeled high-resolution images as training data in more then 1,000 categories. This challenge was won by using a CNN achieving a top5-error rate of 16.4% [1]. Even more remarkable is the fact that the second placed classifier achieved a top5-error rate of 26.1% by using an ensemble of classifiers based on traditional image features such as SIFT, HOG, LBP, etc. This dominance of CNNs on large, high-resolution datasets makes them a very promising technique to solve complex computer vision tasks. Upon further researching the role of CNNs in these computer vision tasks I noticed that most of the tasks involve some form a hierarchy. If we aim to recognize traffic signs, the different types of signs can be structured hierarchically, e.g. all the different danger signs can be considered hierarchical descendants of a single parent (an abstract danger sign). Despite hierarchy being prevalent in the labels, it is rarely taken into account during training. Deep learning research aimed at computer vision concerns itself with finding algorithms that make as little errors as possible, but it generally does not care which errors it does end up making. This thesis will attempt to further research this issue and will propose a solution. Chapter 2 handles CNNs and in what way they differ from ANN. The chapter assumes the reader is familiar with the theories around machine learning and ANNs. It is in no means meant to be a complete explanation of CNNs but it will help the reader understand the rest of the thesis. Readers with extensive knowledge on CNNs or DNNs can skip this chapter. Chapter 3 sketches existing research on hierarchy in machine learning. From this research questions are posed, and a new technique to handle hierarchy is proposed. Chapter 4 provides the reader with the necessary information about the experiments run in order to evaluate the proposed technique. It explains in detail the final network hierarchy on which the technique is evaluated and the hardware on which it was run. Chapter 5 evaluates the results achieved in chapter 4. The reader can find a full discussion on the impact of the proposed technique, and how it compares to techniques currently in use by the deep learning community. Finally, Chapter 6 will present an overall conclusion. The evaluation from Chapter 5 is summarized and the most notable results are discussed.

21 The end of the chapter contains a listing of work that follows from this thesis. 3

22 4 CHAPTER 1. INTRODUCTION

23 Chapter 2 Convolutional Neural Networks 2.1 Artificial Neural Networks Figure 2.1: ANN with 1 hidden layer ANNs are biologically inspired models that pass information between neurons in order to perform complex tasks. Each neuron produces a numerical output that can serve as an input to any number of other neurons. A set of neurons that all draw inputs from the same set of neurons is called a layer. The layer who s neurons do not serve as inputs to subsequent layers is the output layer as the outputs of these neurons are considered to be the outputs of the network. The layer who s neurons have no inputs is the input layer, its neurons can be used to give a certain value as input to the network. Every layer between the input and output will never be directly 5

24 6 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS observed and is subsequently called a hidden layer. Every connection between neurons has a specific weight associated with it. A neuron multiplies every one of its inputs with the associated weight before taking the sum of all these weighted inputs. To this linear combination of inputs and weights we then apply an activation function. The choice of this activation function turns out to be important and is reviewed in Section It is important to note that this function is not necessarily linear. In fact if it were linear any network would be reducible to an equivalent network of one specific input and output layer. You can convince yourself of this fact by considering that if the activation function is linear in its input, the neuron is linear in its inputs. Any subsequent layer will then again take a linear combination of previously calculated linear combinations which in itself is again a linear combination of the same inputs. The output of a neuron can be seen in Equation 2.1. y is the output of a neuron with N inputs x i for which weights w i are associated. For notational compactness we rephrase the linear combination as a dotproduct. The activation function a introduces the non-linearity of the model. N y = a( w i x i ) = a( w x) (2.1) i=0 The layers in an ANN are fully connected. This means that every neuron in a layer is connected to every neuron of the previous layer, or equivalently all neurons in a layer have the same inputs. It is tempting to think all the neurons in a layer are therefore equivalent. One should however note that there is a lot of freedom in selecting the associated weights for every neuron. Rather than evaluating the output of every neuron individually a computationally efficient way to evaluate every neuron from a layer is introduced. Instead of evaluating the dotproduct for every neuron we structure the weights w i such that they are the columns of a matrix W. As all neurons have the same inputs we can then simultaneous calculate the output of every neuron in a layer and introduce the recurrent expression to evaluate an ANN shown in Equation 2.2. X i+1 = a(x i W ) (2.2) 2.2 Neural networks and supervised learning Supervised learning generally entails two distinct tasks. Regression is the task in which we attempt to approximate a function based on limited set of

25 2.3. A BRIEF HISTORY OF CONVOLUTIONAL NEURAL NETWORKS FOR COMPUTER VISION7 (noisey) samples from that function. To perform regression with an ANN it suffices to construct an ANN with the amount of input neurons equal to the dimensionality of the function. Despite functions having only one output by definition, we often generalize this definition by allowing ANNs with more than one output neuron to evaluate a function with more outputs. The ANN depicted in Figure 2.1 could be used to evaluate a three dimensional function with two outputs. The second task, classification, aims to detect distinct patterns in the input in order to predict how to label this input. A typical example here is the task of labelling handwritten digits (which are inhenertly noisy) with the intended digit. The desired output of an ANN tasked with a classification job is a set of probabilities, one for each possible classification label. In the example of the handwritten digits, we would like the output to be a set of 10 probabilities ( as there are 10 digits ) where each probability reflects the confidence that the given input is the respective digit. We will therefore name each output probability the confidence of the network in the associated label. We can not assume that the output of an ANN behaves like a probability distribution (i.e. the sum of the outputs is 1, and each output is [0, 1]), therefore we add a transformation of the output at the very end that turns the output of the network into a probability distribution. This transformation is called the softmax which can can be read in Equation 2.3 where x i is the output of a neuron i and y i the associated confidence. y i = ex i N ex j (2.3) For consistency the softmax expression is implemented as another layer in the ANN. This does not go against our current definition of a layer as it is merely a non-linear function of the outputs in the previous layer (albeit a rather complex function). It is important to note that, unlike the other layers in an ANN, the softmax layer is completely defined by its inputs without associated weights and can therefore not be shaped according to will. It is merely a transformation of the network output into a probability distribution. 2.3 A brief history of Convolutional Neural Networks for computer vision CNNs are a type of ANN in which the layers are highly structured. They can be considered as an ANN of typically many layers in which the layers

26 8 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS are heavily constrained and structured. The nature of this structure will be explained in detail in Section 2.4. They gained popularity when they were found to perform well on visual data by LeCun et al. in 1998 [2]. Its performance is accredited to the multilayer architecture that allows CNNs to solve complex tasks, while the structure in each layers leads to better scaling behavior as opposed to general ANNs. Up until the work of LeCun image recognition tasks were tackled by using manually designed feature extractors. These extracted features were then fed through a classifier such as a Support Vector Machine (SVM) or an ANN to perform classification. It was the goal of LeCun et al. to come up with a multilayer neural network that could take raw pixel data as an input instead of manually designed features. Such neural network should transform the raw data into appropriate features in the first few layers and then perform classification of these features in the subsequent layers. The most important philosophy behind the workings of a CNN is that the feature extraction is part of the learning process. Therefore it is no longer explicitly required to perform manual feature extraction. This does not mean that CNNs are an excuse to not look at the data as we still have to find suitable hyperparameters for the new concepts that we will introduce. The choice of these parameters is highly dependant on the task at hand, but surprising results have been reached by using off-the-shelf CNNs or even randomly generated ones [3][4]. A detailed discussion on these cases can be found in Section 2.7. Today CNNs are regarded as one of the most powerful techniques to perform computer vision tasks, but it took until 2012 for CNNs to outperform any other classifier on the highly-competitive MNIST dataset [5]. This is mainly due to the fact that MNIST is a low-resolution dataset that requires classification into only 10 labels which makes it possible to manually design very good features. On the very large 1000-categories ImageNet dataset this manual feature extraction becomes far less feasible, which is where CNNs show their dominance by outperforming the best non-cnn classifier by 10% top-5 error [1]. A recent improvement in CNNs is the use of GPUs. Efficient usage of GPUs has been found to speed up the training of a CNN by 2-3 times [5]. This alleviated some practical boundaries on how long a researcher was willing to train a CNN in order to obtain good results. Current prior-art has also invested research in understanding and visualizing the weights of a fully trained CNN. It is a known difficulty that the

27 2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS9 weights of ANNs are difficult to interpret, but this difficulty has never typically been seen as a drawback when it comes to the performance of ANNs. Many of the weights in a CNN, which are explained in Section 2.4.1, do not share this problem and this gives rise to the idea that a thorough understanding of what is learned in a CNN can help design more accurate architectures [6]. 2.4 The architecture of Convolutional Neural Networks Keeping in mind that we wish to achieve a network that takes raw pixels as input, we can quickly see that a fully connected layers do not scale favorably in the number of input pixels. Consider small 32x32 color images. Such images have = 3072 pixel values and as we want a network that takes raw pixel data as inputs we would therefore need 3,072 input neurons. Lets say we get optimal results using 1000 hidden neurons and we want to classify these images into 100 categories. A simple 2-layer network already has , weights to learn. Clearly if we want to scale this network to deal with higher resolution input we need to find a way to reduce the amount of learnable weights. This problem is solved in CNNs by implementing two new types of layers, the convolutional layer and the pooling layer. A complete CNN consists of two parts, a feature extraction part and a classification part. The feature extraction part contains any number of these new convolutional and pooling layers, while the classification part uses the traditional fully connected layers. The first part transforms the input into a set of features which the second part can then use to classify the input. We will show that the new layers are fundamentally no different from the fully connected layers of ANNs. Therefore we can train CNNs using the same supervised training methods we would use in an ANN. The most common method, the backpropagation algorithm is discussed in Section To aid terminology in the following, we discuss an image in terms of width, height and depth. The width and height are the spatial dimensions of the image, while depth can be seen as the information channels of the image. For example, a typical internet thumbnail has a width and height of 200 pixels with a depth of three, the RGB channels. A grayscale image has only a depth of one, while a PNG image can feature the transparancy and color of a pixel resulting in a depth of four. It is important to note that while in these examples the depth has an intuitive meaning such as color, greyscale or transparency, it is not necessarily always the case. In the terminology

28 10 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS of CNNs it is not uncommon for a transformed image to have a depth of 100 or more, without each of these 100 channels having an intuitively clear meaning such as color or transparency. These outputs are generally called feature maps while a channel is referred to as a depth slice The convolutional layer The key difference between a CNN and an ANN is the introduction of a convolutional layer. We aim to introduce a filter in this layer that scans a small region of the image along the full depth. We aim to measure the similarity of the filter with each region in the image. In a more formal wording, we will introduce a discrete mathematical convolution of the image and a filter of fixed size. This filter is small in width and height, but is as deep as the image. Consider O to be the output of convolving the input I with filter F. I and F are both 3 dimensional matrices with each dimension as defined in Section 2.4, namely two spatial dimensions and one depth dimension. D denotes the depth of the input, and therefore also the depth of the filter, while W and H respectively denote the width and the height of the filter. We can then formally define the output O as : D 1 O xy = (F d I d ) x,y = d=0 D 1 d=0 W 1 i=0 H 1 j=0 F i,j,d I x+ i,y j,d (2.4) A visual example is given in Figure 2.2. Where the input image has a width of 5, a height of 1 and a depth of 3. We see in Figure 2.2 that the resulting output decreases in width. In practice we pad the input to a convolutional layer with 0 s on the corners in such a way that the result of the convolution has the same width and height as the input. We do so for two reasons, firstly ensuring that the output of a convolution has the same spatial dimensions as the input can help alleviate issues when designing a CNN as we can determine the size of the output of an arbitrary layer in the network without much consideration. Secondly, padding the input with zeros prevents information at the edges from being under-valuated. If a filter represents a particular pattern that is present in the image but cutoff by the edge it will be overlooked by the according filter. Padding with zeros will allow that filter to better match the cutoff pattern and therefore better preserves information at the edges of an input. Before continuing how a convolutional layer works, we discuss why they work. The idea of convolving filters over an image is not new. Gabor

Each of these filters represents an edge along a specific angle, and the convolution of an image with this Gabor filter results in an output map where edges are strongly visible as can

29 2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS11 Figure 2.2: Convolution taken over an image. filters are used to perform edge detection in images, and involve taking the convolution of certain handmade filters with natural images. You can find these Gabor filters in Figure 2.3. Each of these filters represents an edge along a specific angle, and the convolution of an image with this Gabor filter results in an output map where edges are strongly visible as can be seen in Figure 2.4. Figure 2.3: Gabor filters Now remember that we aim to use the first layers to extract features from an image. If we compare the filters from the convolutional layer taken from a fully trained CNN in Figure 2.5 to the Gabor filters in Figure 2.3 we can

12 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS Figure 2.4: Example result of convolving the image on the left with a set of Gabor filters. definitely spot similarities.

30 12 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS Figure 2.4: Example result of convolving the image on the left with a set of Gabor filters. definitely spot similarities. The trained CNN has learned that these filters generate a useful output, and we can see this to be true as they are similar to the Gabor filters (which are manually designed to be ideal filters to detect edges). Figure 2.5: First layer filters trained on natural images This comparison leads us to the conclusion that training a convolutional layer makes it search for the most informative local characteristics of a set of images. In Figure 2.5 we see that a lot of these filters are Gabor-like edge-detectors but also checkerboard-like patterns and patterns that appear to be wavelets. Next to distinguishable shapes we can also see so-called color blob detectors which can detect the presence of a certain color in regions of the image.

31 2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS13 The next question we can ask ourselves is how we can efficiently apply this mathematical convolution in such a way that the resulting operation does not differ fundamentally from a normal fully connected layer. The answer lies in local connectivity of neurons and weight sharing. As can be seen in Figure 2.2 the output of a convolution at a given point can be calculated from a small region of the input along the full depth. This gives rise to the idea that the output of this convolution can equivalently be seen as a neuron of which the inputs are exactly the pixel values of that region. This neuron is no longer connected to the full input, rather it is connected to a small region that holds all the information it needs. This property is named local connectivity as a neuron is no longer fully connected to all neurons of the previous layer, but only to a small region of it. Figure 2.6: Locally connected neurons

14 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS The second insight that makes convolutional layers tick is the weight sharing. When we consider Equation 2.

32 14 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS The second insight that makes convolutional layers tick is the weight sharing. When we consider Equation 2.4, which describes the convolution, we can notice two important things. Firstly the convolution is nothing more the linear combination of a region of the input I and the filter F. Secondly this filter F does not depend on the value of x and y, which is the point in which we evaluate the convolution. In the equivalent network of Figure 2.6 this means that all the neurons have exactly the same weights as they represent the output of a convolution with exactly the same filter. Together with the local connectivity, weight sharing makes it so a convolutional layer scales incredibly well. Whereas fully connected layers scale proportional to the size of their input squared, a convolutional layer scales with the size of its filter. This filter is typically small, meaning a convolutional layer only holds a handful of weights. These concepts help us understand why CNNs work so well on computer vision tasks. The convolutional filters are used to detect patterns in the input, but this filter is identical wherever it is applied in the input. This spatial invariance of the filter makes it so the same pattern is filtered out of the input wherever it appears, which is very similar to how we as humans perform visual tasks. We notice distinguishing features about an object, no matter where it lies in our visual field. We limited ourselves to a single filter in the previous discussion, but in most cases not all necessary information can be extracted by using a single filter. The generalisation to more filters can be done by creating arbitrary many independent convolutional layers and stack stacking the output as depth slices in a feature map. This makes it so the output of a convolutional layer is again an image of fixed width and height with any number of depth slices, albeit these depth slices do not carry the same intuitive information -like color or transparency- as they did in the input image. It does mean however that the output of a convolutional layer can serve as the input to a next convolutional layer. Figure 2.7: Stride of filters.

2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS15 Next to the size of the filters, the amount of filters and the padding of the image we also have to decide on the stride of each filter.

33 2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS15 Next to the size of the filters, the amount of filters and the padding of the image we also have to decide on the stride of each filter. In Figure 2.7 a stride of one can be seen on the left and a stride of two on the right. The stride determines the overlap of filters during the convolution. Typically a stride of 1 is used, but a higher stride is not uncommon in larger images as it decreases the size of the output by paying for it in accuracy The pooling layer The convolutional layer vastly lowers the amount of learnable weights as compared to a fully connected layer due to the weight sharing and the local connectivity. But the resulting output of these layers is not significantly smaller in width and height as the input image. In terms of depth the output can even increase due to using many filters. Therefore we merely shifted the scalability problems to the classification layers. The proposed solution is the use of pooling layers which subsample the output considerably. Much like the convolutional layer a pooling layer is connected to a local region of the input image and its output can be a variety of subsampling operations such as the mean or the maximum of the region. A layer which performs the maximum operation over a region is depicted in Figure 2.8 and is typically called the Maxpool Layer. Figure 2.8: Effects of a 2-by-2 maxpool layer. Prior-art has empirically shown that from the possible subsampling layers the Maxpool layer significantly outperforms other subsampling operations such as the mean [7]. The Maxpool layer also increases the spatial invariance of a CNN as the pooling operation summarizes the prevalence of a feature in a certain local region.

34 16 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS An interesting side-note is that, despite significant improvements due to pooling layers, some researchers are skeptic towards the use of pooling layers. It is conjectured that the pooling operation filters out a lot of useful spatial information. Most notably Geoff Hinton himself can be quoted as follows: The pooling operation used in CNNs is a big mistake and the fact that it works so well is a disaster. [8] Regardless of this powerful quote by one of the most influential researchers in machine learning the pooling layers allow us to tackle very complex problems by dramatically reducing the amount of features extracted by feature extracting layers Activation Function The activation function traditionally used in ANNs are the sigmoid and tanh functions which are depicted in Figure 2.9 [9]. These functions suffer from two major drawbacks. For one, since they are applied to every neuron in a layer they can be quite costly to calculate due to the exponentials and the division. Secondly, they suffer from what is called the Vanishing Gradient problem [10]. We will discuss this at length in Section but for now assume a problem occurs due to the fact that the derivative of these functions is approximately zero when x > (a) tanh(x) = ex e x e x +e x (b) sigmoid(x) = 1 1+e x Figure 2.9: commonly used activation functions. Rectified Linear Units (ReLUs) were first conceived in the context of Restricted Boltzman machines [11] but have been show to not only greatly speed up the training also improve the results of CNNs [1]. For these reasons the ReLU and the activation functions derived from it are the activation functions of choice, and have almost completely pushed out the use of the sigmoid and tanh. The reason for its superior performance is often accredited

35 2.4. THE ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORKS17 to its constant derivative which alleviates training issues in deep networks (this will be discussed in depth at Section ) and its computationally efficient expression. The ReLU expression can be seen in Figure Figure 2.10: ReLu(x) = max(0,x) The ReLU activation function effectively filters out all negative values. This property leads to what is called sparse coding, a biologically inspired phenomenon. The brain of mammals consists of billions of neurons, but only a fraction of those neurons are active at the same time when processing data. Using the ReLU activation function enforces this effect on any type of ANN. It has been argued that such sparse codes are more efficient (than nonsparse ones) in an information-theoretic sense [12], and that this property of ReLUs is another reason for their superior performance. The previous paragraph also defined a key limitation of ReLUs. To understand this, I argue the following. For a given input pattern only a small fraction of the total number of neurons will activate. If we wish to adapt the behavior of all the neurons based on its current behavior, we can only reliably adapt the small fraction of neurons that are activate at that time. More formally we see that the derivative of the output of a ReLU is 0 with respect to the input if the input is less than zero. This severely limits the flow of information (as a zero derivative means no small change in the input is reflected in the output), and the zero derivative will prove to be a significant hurdle when training neurons using the ReLU activation function. A few attempts have been made to tweak ReLUs so that they overcome this restricted flow of information in the form of Leaky Rectified Linear Units (LReLUs) and Exponential Linear Units (ELUs).

36 18 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS { x if x > 0 (a) LReLU(x) = α 1 x if x (b) { ELU(x) = x if x > 0 α 2 (exp(x) 1) if x 0 Figure 2.11: Variants on RELU with α 1 = 0.1 and α 2 = 1 Both the LReLU end ELU activation functions allow non-zero derivatives when the inputs are negative. LReLUs have been shown to perform equally to ReLUs but they speed up the training process [11]. ELUs are a recent discovery that have significantly improved CNNs on a competitive academic dataset [13]. It is quite interesting to note that despite a computationally heavy expression ELUs still speed up training, as it reduces the amount of training steps. They are however so recent that a full discussion and comparison is yet to be made. 2.5 Training Convolutional Neural Networks As we discussed in Section 2.4, the architecture of CNNs allows them to perform tasks on high dimensional inputs. Despite scaling favorably compared to fully connected ANNs there are still a lot of weights to learn Objective Functions Once the architecture of an ANN and therefore CNN is fixed, its output Y is a deterministic function of the input X and the weights W which can be seen in Equation 2.5. We have a large degree of freedom in shaping this output function due to the high dimensionality of W. A notable recent example of this high dimensionality of W is VGGNet [14] which won the ImageNet ILSVRC-2014 contest in the localization task, and had a total of 140 million weights. Y = F (X, W ) (2.5)

37 2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 19 The act of training a CNN is a (mainly) supervised learning problem. It is the task of adjusting the output function to match that what is reflected in the training data. The first step to perform this task is quantifying how well the current output matches the desired output. We call this quantifier the Objective Function but it is equivalently referred to as a Loss Function or a Cost Function. Once we have found such a fitting measure we can rephrase the training of a neural network as a mathematical optimisation problem. It is our objective to adjust the weights such that they minimize the Loss/Cost/Objective Function. An objective function is the average of costs c over the training data. Each cost c depends on the training sample and weights of the network and returns some measure of how good the network behaves under these weights according to the training sample it is given. A general objective function is given in Equation 2.6 where (X n, Y n ) is a training sample and F (X n, W ) is the output of a network given input X n and weights W. C(W, X, Y ) = 1 N N c(y n, F (X n, W )) (2.6) n=0 Recent work suggests that it can be beneficial to pre-train the network in an unsupervised manner [15] [16]. The benefits are numerous, especially when labelled data is scarce [17]. Despite having an important role in DNNs, unsupervised pre-training is not mentioned in the rest of this thesis as CNNs suffer less from the problems solved by unsupervised pre-training as general DNNs [1] [13]. In what follows I will discuss two different objective functions, Mean Squared Error (MSE) and Cross Entropy. These are two of the most comonly used objective function in machine learning or statistics in general. They also inspired a set of other objective functions, each solving one of the downsides of their respective original often at the cost of computational efficiency. MSE is used in regression problems, while cross entropy is used in classification tasks. Mean Squared Error The MSE function seen in Equation 2.7 is widely used in statistics and machine learning to describe the error between a prediction (or estimation in statistics) and the ground truth. Its prevalence is partly due to its intuitive interpretation but foremost due to its relationship with the Maximum Likelihood Estimator (MLE). It can be shown that under reasonable conditions the value of W that minimizes Equation 2.7 is equal to the MLE of W given the function F and training data (Y n, X n ) [9].

38 20 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS C(W, X, Y ) = 1 N N (Y n F (X n, W )) 2 (2.7) n=0 Equation 2.7 is often multiplied by a constant 1 2 as a mathematical convenience to ease differentiating with respect to W. The reason why we want this differentiation to be convenient will become clear in Section Cross Entropy MSE emphasizes the distance between a prediction by the network F (X n, W ) and the ground truth Y n. This implies there is some innate ordering in our predictions and truths which is trivially so for regression problems (a 4 is closer to 3 as a 5), but this is no longer the case for classification. When the ground truth is label 3, a classification as label 5 is not twice as bad as label 4. C(W.X, Y ) = 1 N N log(p (F (X n, W ) = Y n )) = 1 N n=0 N log(p (Y n X, W )) n=0 (2.8) We ve seen in Section 2.2 that ANNs used for classification output a probability density that resembles the confidence in given label. We would aim to optimize the network such that the confidence in the truth (Y n ) is as close to 1 as possible (and therefore, since the output is a a probability density, the confidence in other labels would drop to 0). To improve the numerical stability, rather than maximizing the sum of a networks confidence in the ground truth, we minimize sum of the the negative logarithms of those confidences Gradient Descent Gradient descent is a method to reach the nearest local minimum of a function given a starting point X 0. Assuming X is a vector of inputs (x 1, x 2,..., x n ) the method is based on the notion that it can proven under reasonable assumptions that at any point X in which a function f is differentiable the following statement holds for some small λ > 0: X i+1 = X i λ X f(x i ) (2.9) f(x i+1 ) f(x i ) (2.10)

2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 21 Where X is the gradient of the function which is defined as the vector who s components are the partial derivatives of f with respect to the components

39 2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 21 Where X is the gradient of the function which is defined as the vector who s components are the partial derivatives of f with respect to the components of X. f = ( δf δx 1, δf δx 2,..., δf δx n ) (2.11) Applying Equation 2.9 to Equation 2.6 -which is the function we ultimately aim to minimize- gives us W i+1 = W i λ W C(W ) (2.12) 1 N W i+1 = W i λ W c(y n, F (X n, W )) N (2.13) W i+1 = W i λ 1 N n=0 N W c(y n, F (X n, W )) (2.14) n=0 If we consider subsequent X i points from Equation 2.9 until f(x n+1 ) = f(x n ) we have either reached a local minimum, i.e. no small change of X i in any of the n directions reaches a lower function value as X i itself or a saddle point, i.e. a point in which f = 0 that isn t a local minimum. Figure 2.12: Saddle point At first glance the notion of local minimum is a crippling counterargument for using gradient descent in order to find a minimum. It turns out we do not have to worry about getting stuck in such a local minimum as the prevalence of local minima decays exponentially with respect to the dimensionality of the problem [18]. A bigger problem however is the high prevalence of saddle points. Due the prevalence of these saddle points or pseudo-saddle points ( where f 0 ), the cost function shows a lot of regions of low curvature. Due to

40 22 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS this issue DNNs (and therefore CNNs) were thought of as nearly impossible to train with gradient descent as the weights would often get stuck in nonoptimal regions of very low-curvature. Techniques involving higher order derivatives were introduced so the training can make a big step in lowcurvature regions and small steps in high-curvature regions. These methods have been shown to obtain results superior to those trained with gradient descent. In one dimension we can develop such a second-order method called Newton s Method using the second-order Taylor expansion of a function: f(x + x) = f(x) + xf (x) + x2 2 f (x) + η (2.15) with η O( x3 6 ) (2.16) Provided x is small we can assume η to become negligible as it is proportional to x 3 such that Equation 2.15 is a strong approximation of f in the neighborhood of x. With this in mind we can minimize f by finding x such that f(x + x) is a minimum. Or: d d x d (f(x + x)) = 0 d x (f(x) + xf (x) + x2 2 f (x) f (x) + xf (x) = 0 x = f (x) f (x) ) = 0 x i+1 = x i + x i = x i f (x i ) f (x i ) (2.17) We then recursively follow Equation 2.17 to reach a minimum for f. The benefit of this technique over gradient descent is clear in the denominator of x i. When in a low curvature region -where f is small, as f changes only slightly- we take a big step in the direction of f to avoid getting stuck. Similarly when f is large -in an area of high curvature- the step we take will be small to avoid stepping over a minimum. The generalisation towards multiple dimensions follows from replacing derivatives with gradients and Hessians, which is the matrix of second-order derivatives.

41 2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 23 f (x) f(x) f (x) H(f)(X) Despite being clear solution to stuck gradients, these second-order techniques require the computation of the Hessian matrix H(f), or equivalently all second order derivatives. As of today we do not know how to calculate that Hessian in a large neural network other then reverting to numerical estimations. A proposed solution is that of the Hessian-free second-order optimization developed by Martens et al.[19] but due to state-of-the-art networks reporting no real issues when it comes to training with first order methods [1] [20], gradient descent is still preferred when carefully applied [21]. Momentum From Equation 2.9 we see that the update rule in gradient descent is instantaneous. i.e. the change in X at one step depends only on the gradient in that point at that step. We can significantly improve [22] rate of convergence by introducing a velocity vector in Equation 2.9 such that: X i+1 = X i + V i+1 (2.18) with V i+1 = µv i λ f(x i ) (2.19) In Equation 2.18 we refer to µ as the momentum and to λ as the learning rate. The benefits of including a momentum become clear if we again consider regions of low curvature. As follows from the definition of such regions the gradient changes only slightly across steps. In traditional gradient descent this leads to a very slow learning process or even to the algorithm getting stuck in such a region. With the inclusion of a momentum however, the update accumulates those directions that don t change across steps leading to a growing stepsize in such regions, while staying small in high-variable regions. From the previous discussions we can take away that if the objective function is differentiable with respect to the weights of the network we can use gradient descent to minimize this function. We do still have to consider how we can efficiently find that gradient Backpropagation Finding the gradient of an objective function with respect to the weights proved to be a difficult issue, as gradient descent was largely unexplored

42 24 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS in neural networks until the famous 1986 paper by Rumelhart and Hinton [22]. In this paper the idea of propagating errors back through the network dubbed backpropagation, or backprop is explored and shown to be a fast algorithm for computing the necessary gradient. Backpropagation is based on the chain rule that states when g is differentiable in x and f differentiable in g(x), f(g(x)) is differentiable and can be written as df dx = df du du dx (2.20) With u = g(x) (2.21) Next let s reconsider Equation 2.5, rather then looking at the input-output behavior of the network as a whole we can regard it as a subsequent application of functions on the input. If we consider f l to be the function applied in layer l, and F to be the function of the complete network with L layers we can write F = f L f L 1... f 2 f 1 Or with X as the input to the network, W all the weights of the network and W l as the weights for layer l we write F (X, W ) = f L (f L 1 (...f 2 (f 1 (X, W 1 ), W 2 )..., W L 1 ), W L ) Remember that we are not looking for the gradient of the network, but rather the gradient of the objective function C. This C is one of the functions described in Section applied on the output F (X, W ), and is therefore subjectable to the chain rule once more. The resulting derivative of objective function C with respect to weight w i after applying 2.20 would look like: dc = dc df df L... df 2 df 1 (2.22) dw i df df L df L 1 df 1 dw i To see how this equation has helped us find the gradient, we consider dc what the factors in Equation 2.22 represent. df is the rate of change of the objective with respect to the output of the network, a quantity that is calculable from knowing objective function C and the output of the network. is equal to 1 as f L represents the last layer which is equivalent to to the df df L df n network output. The factors df n 1 are the rate of change in a layer, with respect to the output of the previous layer. Once the output of the very

43 2.5. TRAINING CONVOLUTIONAL NEURAL NETWORKS 25 last layer is know, we can therefore calculate this factor and work our way through the equation multiplying all the factors that can be calculated. The quantity df 1 dw i can be calculated as w i is part of the weight in layer l 1. If it was not, and therefore part of some layer n > 1 Equation 2.22 would truncate at... df n+1 df n df n dw i, leaving us with an equivalent case. All the factors of Equation 2.22 can therefore be calculated locally in each layer (it only needs to know its inputs, which are the outputs of the previous layer, and its own weights). This is where backpropagation gets its name, we do a forward pass through the network such that all the outputs and the eventual objective score is known, after which we make a backwards pass, calculating the derivatives along the way. To simplify notation the process of training a network using a gradientbased optimization method where the gradient is calculated based on backpropagation is often pars-pro-toto referred to simply as backprop. Vanishing Gradient In order to describe the vanishing gradient problem we take a deeper look df at the factors n df n 1 from Equation Recall from Equation 2.2 that the output of a layer is nothing more than a linear combination of its inputs with its weights passed through an activation function a. For ease of notation we create a new function i that is the linear combination of inputs with the weights. differentiating can therefore be done with the chain rule. dfn df n 1 df n = da di (2.23) df n 1 di df n 1 Substituting Equation 2.23 in Equation 2.22 pinpoints the vanishing gradient problem. The factor da dfi (or the derivative of the activation function with respect to the linear combination that serves as its input) suddenly appears once for every layer. We can therefore see that Equation 2.22 is proportional to the factor da dfi to some power K that is the depth of the layer in which w i plays a role. dc ( da dw i dfi )K When our network has many layers, the weights in the first layers will have a large K. For da dfi < 1 this means the gradient for weights in the first layers decays exponentially, whereas da dfi > 1 would lead to enormous values for the gradient. The first case is dubbed the vanishing gradient problem and is

44 26 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS prevalent when the sigmoid or tanh activation functions from Figure 2.9 are used as their derivative is close to 0 outside a small interval around 0. This also explains why the ReLU alleviated (not solved, even with constant da dfi Equation 2.22 is still a product of many possibly small values) the vanishing gradient problem as its derivative was a constant 1. This fundamental issue with deep networks make the results achieved on ImageNet by deep networks even more remarkable. Despite a clear mathematical reason why backprop shouldn t work on deep networks, it does not appear to be an issue. It is however important to note that CNNs are highly structured in the lower layers, and that the weights of the convolutions are strongly constrained. As will be discussed in even randomly initialised convolution weights can be used to perform strong classification which leads us to believe that the training needed in the lower layers is minimal provided careful initialization of these layers [21]. Research now considers the vanishing gradient problem as a fundamental limit, but not a dealbreaker. Careful initialization, greedy layerwise pretraining and highly structured or constrained networks (which CNNs are) can make it so the supervised learning required at lower layers is limited. 2.6 Getting more out of Convolutional Neural Networks Stochastic gradient descent Looking at the gradient of the general cost function 2.12 we notice that for large N it becomes quite a nuisance to calculate the cost or its gradient as it involves summing over large N. Stochastic gradient descent avoids this issue by estimating the gradient based on a subsample S of the whole training set N. Using the stochastic approach involves selecting a few new parameters such as the size of the subset S and a learning rate (typically a lot smaller as with the non-stochastic variant). Some insights have been collected [23] but practically you can begin using stochastic gradient descent with only a few insights ˆ W C(W ) = 1 S S W c(y n, F (X n, W )) (2.24) n=0 Use stochastic gradient descent when time or memory are a bottleneck during training.

45 2.7. REGULARIZING CONVOLUTIONAL NEURAL NETWORKS 27 Employ more iterations ( N / S ) with a smaller learning rate Use a random permutation of the training set such that each subset S is independent and representative of your full training set Data Augmentation Generally spoken you can improve results of any machine learning problem by obtaining more data. It is however not often easy or cheap to obtain such data. Consider a case where we would like to classify EEG s of a patients brain to determine if he s likely to suffer an epileptic attack. In order for a model to learn this task it needs samples of an EEG when an attack is likely to happen, which is cumbersome to obtain. Therefore despite the high value of good data it is not straightforward (or profitable) to obtain arbitrary amounts of it. Synthetically generating more data based on the existing data is dubbed data augmentation. This process is highly dependant on the task at hand and will therefore not be discussed despite its relative importance. 2.7 Regularizing Convolutional Neural Networks Highly complex models are capable of performing highly complex tasks, therefore it can be tempting to make your model infinitely complex (or at least as complex as your hardware will allow) such that it trivially has the necessary complexity to complete the task at hand. These infinitely complex models will be able to achieve perfect scores on the training data at any task, but will not behave accordingly on data it was not been trained on. Perhaps the clearest example is a model that is so complex it can keep every datapoint used in training in its memory. It can therefore perfectly perform tasks on these points as it has the answer in memory. It should however be noted that the eventual goal of a model is always to perform well on unseen data. This possibly large gap between performance on training data and performance on unseen data is subsequently called overfitting as a model will be overly eager to fit well on the training data it tends to behave unexpectedly on unseen data. It would be beyond the scope of this thesis to discuss all the techniques used to prevent overfitting, but two cases are discussed explicity due to their importance in CNNs The Convolutional Layer As weight sharing is a strong constraint on the complexity of convolutional layers they appear to be strongly regularized. Keeping in mind that the first

46 28 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS layers of a CNN are considered to be the feature extractors, this strong regularization should mean that trained layers should perform well regardless of the task at hand. This can be shown by the good results reached by using the convolutional layers of a trained network known as OverFeat as an of-the-shelf feature extractor to perform image classification. OverFeat was trained for the ImageNet 2013 challenge using millions of datapoints and still yields highly competitive results when used as a feature extractor for a classification task which has nothing to do with ImageNet [4]. Sermanet and LeCun take it one step further by using randomly initialized convolutional layers and training a classifier to work with these random features. They did so on a on a dataset of traffic signs, alongside a fully-trained CNN. Not surprisingly the fully trained CNN performed the best with an accuracy of 99.17%. What is surprising however is that a classifier trained to use features generated by the random convolutional layers reached competitive results of 97.33% [3]. This remarkable generality of the convolutional layers can be interpreted as two things. For one regularizing a CNN is not harder than regularizing a general feed-forward neural network of equivalent depth and parameters. Secondly the feature extraction layers of a CNN can be used as an off-theshelve feature extractor for an arbitrary problem. Provided the full weights of the pretrained layers are available, these weights can also be used as initialization weights which can then be finetuned for a specific problem and drastically improve training time as compared to starting from random weights Dropout The overfitting behavior of large neural networks (not just CNNs) has been found to greatly reduce when random neurons are ommited during the training phase [24]. After every training step the output of a neuron is randomly set to 0 with a chance of p. We can implement this by introducing an extra layer between all layers of the neural network that multiplies some outputs of the previous layer by 0. Another, perhaps more intuitive, way is to define the activation function of every neuron as B Bernouille(p) (2.25) a (W x) = B a(t) (2.26) After training, dropout is no longer applied. At this time the amount of neurons that is active on average increases by a factor 1 p. Therefore the weights of every neuron in the network should be multiplied with a factor of p, to account for the increased average input.

47 2.7. REGULARIZING CONVOLUTIONAL NEURAL NETWORKS 29 Dropout has been found to consistently improve the generalization of neural networks [24]. To understand why dropout improves performance and how dropping random connections results in better generalisation we informally consider the results of applying dropout in two ways. Firstly, the resulting network after randomly removing (or nullifying) connections from a larger neural network results in a new neural network that is a random subset of the original network as every neuron from the new network is also present in the original. When dropout is applied after every training step, the next training step trains a different one of these network subsets. After a number of these training steps we combine all these trained subsets into the original network that is appropriately weighted (with a factor p). In general, a set of independent strong classifiers can be made to perform better as the individual classifiers when combined properly [9]. Dropout can therefore be regarded as an efficient way to combine an enormous set of strong classifiers. These classifiers might not be independent but empirically and intuitively a combination of expert opinions outperform a single expert. This property is analogous to that of random forests. Secondly, when applying dropout a single neuron in the network can not rely solely on a complex combination of its input as that input is no longer guaranteed to be there for every iteration of the training process. This is best explained by following example. Imagine for a given dataset of images all the images relating to a guitar have bright green pixels in the top left due to some camera error. Even though these pixels are not a significant feature (we know its a camera error, the network does not), they can be used to identify images of guitars in the dataset. Applying dropout to such a network would make it so none of the neurons can rely on these pixels being present in the pictures of guitars and therefore every neuron is forced to learn more reliable features rather than a (possibly complex) unreliable feature.

48 30 CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS

49 Chapter 3 Objective Function: Design and Adaptation The research within computer vision traditionally considers classification errors in a binary manner, a classification is correct or it is not. As discussed in Section 2.2 a network tasked with classifying an input outputs a probability density that represents the chance that the input belongs to a respective class. These probabilities are often considered a networks confidence in each label. In this case the performance of the network is based solely on its confidence in the correct classification label. If a given input belongs to label A, the overall performance is based solely on how confident a network is in this label A. This binary approach to errors means research generally does not care which error occurs, only that an error occurs. When a dataset is structured in some hierarchy, classifying an oak as a pine tree is of course wrong, but it would hierarchically be a better guess than classifying that oak as a sports car. In what follows several approaches are discussed that introduce some form of hierarchy on the data and/or errors in order to reduce the severity of an error. At the end of this chapter several open questions will be formulated which will be researched in following chapters. 3.1 Discriminative Categories When the classes of a dataset form a hierarchy, it can be exploited for more favorable results as opposed to a flat class space. Gao et al. [25] aimed to exploit the hierarchy by building a tree in which each node is a binary SVM. Their main contribution is the reduced computational complexity of 31

50 32CHAPTER 3. OBJECTIVE FUNCTION: DESIGN AND ADAPTATION structuring a multiclass classifier this way as opposed to the more general one-vs-all or one-vs-one multiclass classifiers. Next to the improvement in computational complexity they also report a significant increase in classification accuracy. Once an image is passed through a binary SVM it is effectively bound to the left/right side of its subtree. This provides a rather intuitive approach to multiclass classification where an image is first classified coarsely, and then classified using more specialist classifiers to perform the finer classification. Another benefit that was not discussed by Gao et al. is that this discriminative classification tree allows for an easy opt-out of a classification. If at some point a binary classifier in the nodes can no longer confidently classify an image, we can allow the global classifier to use this coarser category as a final prediction rather than a fine grained category that has a high probability of being wrong (as the classifier is not confident). This idea was further developed in the research discussed in following section. 3.2 Hedging Your Bets A more recent approach taken towards introducing this new look on errors in computer vision was done by Deng et al. [26]. Their work defines a trade-off between the accuracy and the specificity of a prediction. It is important to note these metrics relate to a prediction, not to the network. Whereas both accuracy and specificity are well defined metrics to describe performance of a network, in this case they are used to describe respectively the correctness and information gain of a prediction. The accuracy is either 1 (the prediction is correct) or 0. The specificity is a measure for the information in a prediction. If we label a tree as either a tree or an oak, both would be correct but the latter label holds more information and is therefore more specific. Deng et al. provide an algorithm to maximize the information given by a prediction (i.e. its specificity) while guaranteeing an arbitrarily high average accuracy. The specificity-accuracy trade-off is is best illustrated by the following example. Consider a classifier trained on a dataset of animals to detect the different species of animals. A classifier that confidently detects everything to be an animal achieves a perfect accuracy as it is never wrong, but the extra information given by the prediction is naught, as the dataset already implies that everything is an animal. On the other hand, a classifier that can predict everything up to the subfamily of an animals species provides us with

51 3.3. SEMANTIC COST 33 maximum information but such a classifier is bound to be less than perfect (the state of the art top-5 error on ImageNet as of this writing 6.7% [27]). Deng et al. provide a way to automatically select the correct level of specificity. A classifier will decide the level within the hierarchy that has both the required accuracy and the highest specificity. A real comparison of how good it performs compared to the state of the art is rather difficult as no comparable benchmark has been set. I argue that despite this technique of opting-out of a specific classification sounding promising it is not generally applicable. Often we will still want our classifier to make a best guess even if there is no guarantee on the correctness of this guess. A more interesting solution to the problem would be a classifier that is as specific as possible and therefore can make mistakes, but those mistakes are still as semantically close to the ground truth as possible. Consider the case when a classifier is presented with an image of a tree. If the classifier is not sure what type of tree it is given, one trained with the hedging your bets technique would label the image as just a tree. If that image is actually an oak, labelling it as a pinetree would contain just as much information as labelling it as a tree. The problem remains to make sure the classifier does pick one of these close guesses, namely a tree, rather than a completely different label that by coincidence shares a lot of features with the image of the oak. 3.3 Semantic cost The work of Zhao et al. [28] studies the results of introducing a semantic relatedness of classes in the soft-max likelihood. They consider the benefits of hierarchical approaches to be two-fold. On the one hand it will aid in reducing the severity of the errors as is sketched in the introduction of this chapter, but it can also aid in reducing the effective dimensionality of features. It is shown that classes with a strong semantic relatedness will be able to share features and reduce the total number of features needed to distinguish between all possible classes. The conditional likelihood is traditionally given by the soft-max: P (y i x i, W ) = exp(wt y i x i ) k exp(wt k x i) (3.1) We now aim to introduce a measure S where S i,j measure the semantic relatedness of classes i and j. Which leads us to a new likelihood ˆP such

52 34CHAPTER 3. OBJECTIVE FUNCTION: DESIGN AND ADAPTATION that: ˆP (y i x i, W ) = 1 Z M S yi,rp (r x i, W ) (3.2) r=1 Where 1 Z is a normalisation constant such that ˆP is a probability distribution. M is the total number of classes. Computing this 1 Z and simplifying the expression defines the augmented soft-max as: ˆP (y i x i, W ) = K r=1 S y i,rexp(w T r x i ) K K r=1 k=1 S k,rexp(w r x i ) (3.3) In order to define the measure for semantic relatedness S Zhao et al. first define a distance matrix D of which the elements are given by: D i,j = length(path(i) path(j)) max(length(path(i)), length(path(j))) (3.4) path(i) defines the path from the root node (the base class) down the hierarchy to class i, while p 1 p 2 defines the number of classes that are part of both paths p 1 and p 2. The matrix S is then defined as: S = exp( κ(1 D)) (3.5) Where κ is a hyper-parameter that governs the decay of the relatedness. The work of Zhao et al. provided a useful framework on which to base what follows. 3.4 Semantic Cross Entropy Previous work inspired the following research. I will introduce a hierarchyaware objective function for and evaluate the performance of CNNs trained with this new objective function against set benchmarks. The semantic cross entropy is inspired heavily by the work of Zhao et al. [28] and the Cross Entropy objective function from Section The proprosed objective can be seen in Equation 3.6 where L is the set of M possible labels. C(W, X, Y ) = 1 N N 1 M 1 n=0 m=0 S Yn,L m log(p (L m X n, W )) (3.6)

53 3.5. SEMANTIC RELATEDNESS MATRIX 35 The values of S i,j describe the semantic relatedness of labels i and j. The inner sum of Equation 3.6 is therefore a sum over all the possible labels, where the (logarithm of) the confidence in each label is weighed by that labels relatedness to the ground truth. If all concepts are only related to itself S will be an eye-matrix as S i,j = 0 for i j. In that case the inner sum will only be non-zero when L m = Y n which reverts the semantic cross entropy back to the general cross entropy. Perhaps a clearer way to think about the proposed function is that the value of S Yn,Lm allows a network to be somewhat confident in a label L n that is not the ground truth Y n provided that label is semantically related to the ground truth. 3.5 Semantic relatedness matrix Continuing on the work described in Section 3.3 we define a fitting semantic similarity measure S to be used in the proposed semantic cross entropy. We assume the dataset over which we will train and evaluate our network has labels structured in a tree. The leaf nodes are the labels we want the network to learn, while the intermediate nodes are (possibly abstract) semantic parents of the lower nodes. In such a dataset we can define a distance between nodes i and j. D i,j = length(path(i) path(j)) max(length(path(i)), length(path(j))) (3.7) Defined exactly as in Equation 3.4 S i,j = exp( κ(1 D i,j )) (3.8) As a final step the matrix S will be normalized such that the rows (or equivalently the columns due to symmetry) sum to 1. This normalization makes it so none of the labels is more important as any other when used in the semantic cross entropy objective. The impact of κ will be discussed in Section 5.1.

54 36CHAPTER 3. OBJECTIVE FUNCTION: DESIGN AND ADAPTATION 3.6 Example use Consider a simple example where a network is asked to label images as either a rottweiler, a doberman or a cat. An S matrix could look as follows. rottweiler doberman cat ( ) rottweiler S = doberman cat We will ask two fictional networks to classify the image of a rottweiler depicted in Figure 3.1. The resulting probabilities are listed in Tabel 3.1 Figure 3.1: Image of a rottweiler rottweiler doberman cat confidence network confidence network Table 3.1: prediction results of rottweiler image The first network seems to think the presented image is that of a doberman. We know this to be wrong but all in all the picture might not be clear enough to confidently distinguish between the breeds of dogs. We can however be quite sure it is not a cat, therefor network 2 can be considered blatantly wrong. network 1 semantic 0.7 log(0.2) log(0.6) = 0.43 non-semantic log(0.2) = 0.70 network 2 semantic 0.7 log(0.3) log(0.1) = 0.66 non-semantic log(0.3) = 0.52 Table 3.2: semantic and non-semantic cross entropy Table 3.2 depicts the results of evaluating both the non-semantic (the traditional) and the semantic cross entropy on the results of both networks. Non-semantic cross entropy prefers network 2, while the semantic variant picks network 1.

55 3.7. RESEARCH QUESTIONS 37 Now imagine that the ground truth on Figure 3.1 is wrong, and that the image in fact depicts a rottweiler. By introducing the semantic relatedness of labels we made the network more resilient to errors in the training set if those errors are still semantically related to the ground truth. 3.7 Research questions Current research shows that introduction of hierarchy or semantic relatedness of concepts in classifiers can speed up training and improve classifier accuracy. Will introduction of hierarchy in CNN result in similar or other noticeable improvements? Can a CNN be made semantically aware? If we give it some measure of how concepts are semantically related, will that lead to a network that outperform others on semantically inspired criteria? For instance if a CNN is not only be trained on the different breeds of cats and dogs, but also on the fact that all of those labels are either a cat or a dog, will such a network correctly predict the animal even if it isn t correct in its prediction of the breed? How can we evaluate the performance of an hierarchy-aware classifier? Most benchmarks in literature report the error on a test set. This means we can t effectively compare results set on traditional benchmarks with those achieved with an hierarchy-aware classifier. I argue that the total number of errors will increase, but that the average severity of an error will decrease considerably. An effort will have to be made to relate these new results to those reported in literature.

56 38CHAPTER 3. OBJECTIVE FUNCTION: DESIGN AND ADAPTATION

57 Chapter 4 Research Strategy 4.1 Hardware Setup The experiments discussed in the following sections were run on an Ubuntu LTS machine at the iminds ilab.t Virtual Wall. The full specifications of the machine can be seen in Table 4.1 GPU CPU memory L1 cache L2 cache L3 cache storage NVIDIA GeForce GTX980 Intel Xeon E core 2x Samsung 16Gb 288-Pin DDR4 SDRAM 384Kb 1536Kb 15Mb OCZ Deneva 2 480Gb SSD Table 4.1: Hardware specifications The NVIDIA GeForce GTX980 has 2048 CUDA Cores and is reported to be in the top-3 GPU for deep learning in numerous benchmarks set in Despite the power of one GPU, time and memory are still a bottleneck in many cases. For this reason many researchers and deep learning tools have made the transistion to a multi-gpu system [29][30]. In my work, the single GPU proved enough. 4.2 Dataset selection Belgian Traffic Signs Most of my early exploration of CNNs was done on a dataset of labelled traffic signs collected from Google streetview. As these images are extracted from streetview, they are heavily different in rotation, scale, lightning due 39

58 40 CHAPTER 4. RESEARCH STRATEGY to weather or time of day, time-worn paint, etc. Therefore one of the main hurdles in this dataset was the normalization of the images. Figure 4.1: Example from the traffic sign dataset Next to normalization the results on this dataset improved considerably by extensive data augmentation. As already mentioned in Section data augmentation is highly dependent on the task at hand. In the case of classifying traffic signs horizontally mirroring all the images is very effective, while vertically mirroring effectively worsens performance. It is important to keep in mind that traffic signs were designed to be easily recognizable. It is therefore no surprise that even the simplest CNNs can reach error rates of less than 5% with proper normalization and augmentation. The dataset does not contain the required complexity to allow noticeable improvements. It is for this reason that I decided not to perform experiments with the proposed semantic cross entropy on this dataset ImageNet ImageNet is one of the most competitive datasets in computer vision. As of this writing it contains 14,197,122 high-resolution images labelled with one or more of categories. The labelling is non-exclusive and semantically structured according to WordNet, a lexical database for the English language maintained at Princeton [31]. An example is given in Figure 4.2. At first glance ImageNet is an ideal candidate for experimentation. The labels are structured semantically according to WordNet and both the amount of labels and the resolution of the images make them complex enough to allow improvement. The size of the dataset however proved to be too large for the scope of this thesis.

4.2. DATASET SELECTION 41 Figure 4.2: Example from ImageNet. A few of its labels are Panda, Giant Panda, Mammal and Vertebrate 4.2.3 CIFAR Collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at Toronto University, the CIFAR dataset contains 60000 tiny images [32].

59 4.2. DATASET SELECTION 41 Figure 4.2: Example from ImageNet. A few of its labels are Panda, Giant Panda, Mammal and Vertebrate CIFAR Collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at Toronto University, the CIFAR dataset contains tiny images [32]. The set of images can be labelled either in 10 coarse labels, or 100 finer labels respectively called the CIFAR-10 and CIFAR-100 datasets. All labels are mutually exclusive, a single image only has a single label. These labels are not inherently hierarchical but as CIFAR-100 and CIFAR-10 represent respectively fine labelling and coarse labelling over the same dataset, a single-step hierarchy can be found nonetheless. Figure 4.3: Example from CIFAR, 32x32 tiny color image labelled as dog

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei